Introduction

For this project I created my own dataset. I used the Yahoo Finance API and Python to scrape the S&P 500 company list, the Russell 2000, and the the S&P 400 to get a mix of small, mid, and large cap stocks. I then selected 1000 companies at random from this aggregated list. I then used the Yahoo API and Finnhub to get the financial data for the companies and put it into a CSV. Github Link to process: .

The objective of this project is to develop a predictive model that estimates the Recent Close Price of stocks based on various market and financial variables. I hope to answer the following questions. 1. What are the most significant predictors of stock prices? 2. How do market variables interact to influence stock prices? I want to see which variables have interaction effects. 3. Can we predict stock price trends accurately? I want to see if we can get a good R^2 value that means we can use historical data values to predict current prices.

By developing and validating this model, the goal is to generate insights into the economic drivers of stock prices, which can potentially inform investment strategies or financial decision-making.

Description of Variables

Ticker: Identifier for the stock (not used as a predictor). Sector: Categorical variable indicating the sector. Industry: Categorical variable indicating the industry. Region: Categorical variable indicating the region.(not used as a predictor because “North America” for all) Market Cap Classification: Categorical variable indicating the market cap classification. Volatility Classification: Categorical variable indicating the volatility classification. Growth vs Value: Categorical variable indicating growth or value classification. P/E Ratio: Quantitative variable representing the price-to-earnings ratio. Dividend Yield (%): Quantitative variable representing the dividend yield. Beta: Quantitative variable representing the stock’s volatility relative to the market. Avg Volume: Quantitative variable representing the average trading volume. Recent.Close.Price: Quantitative variable representing the most recent closing price. (target) EPS: Quantitative variable indicating the portion of a company’s profit allocated to each share. Revenue: Quantitative variable indicating the company’s revenue. Net Income: Quantitative variable indicating the net profit. Debt-to-Equity Ratio: Quantitative variable comparing a company’s total liabilities to its equity ROE: Quantitative var calculating how much profit a company generates with shareholder money P/B Ratio: Quantitative var calculating how much profit a company generates with shareholder money Free Cash Flow: Quantitative var comparing a company’s market value to its book value. Analyst Ratings: Categorical Variable that I got from finnhub API based on ratings from analysts Insider Transactions: Categorical Variable (finnhub API) that indicated insider transactions ## R Libraries and Helper Functions

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(tidyr)
library(dplyr)
library(Boruta)
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: iterators
## Loading required package: parallel
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(olsrr)
## 
## Attaching package: 'olsrr'
## 
## The following object is masked from 'package:datasets':
## 
##     rivers
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:olsrr':
## 
##     cement
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(margins)
library(corrplot)
## corrplot 0.95 loaded
library(effects)
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
# i like using helper function because it makes my code more modular and thats how i write my CS programs
#this is to calculate optimal number of bins for a histogram
helper.calculate_bins_sturges <- function(data) {
  n <- nrow(data)
  bins <- ceiling(log2(n) + 1)
  bins
  return(bins)
}


# this is a helper that can plot either a one variable histogram or many variables from the same dataset through facet_wrap
helper.plot_histograms <- function(data, color = "blue", bins = NULL, variable_name = NULL) {
  if (is.null(bins)) {
    bins <- helper.calculate_bins_sturges(data)
  }
  if (!is.null(variable_name)) {
    ggplot(data, aes(x = .data[[variable_name]])) + 
      geom_histogram(bins = bins, fill = color, alpha = 0.7) + 
      theme_minimal() + 
      labs(title = paste("Distribution of", variable_name))
  } else {
    #another debugging
    if (!all(c("Variable", "Value") %in% names(data))) {
      stop("Data must contain 'Variable' and 'Value' columns for multi-variable plotting.")
    }
    ggplot(data, aes(x = Value)) + 
      geom_histogram(bins = bins, fill = color, alpha = 0.7) + 
      facet_wrap(~ Variable, scales = "free") + 
      theme_minimal() + 
      labs(title = "Distribution of Predictors")
  }
}


#box plot basic helper using ggplot
helper.create_boxplot <- function(data, x_var, y_var = NULL, fill_color = "blue", title_prefix = "Latest Closing") {
  if (!is.null(y_var)) {
    ggplot(data, aes_string(x = x_var, y = y_var)) +
      geom_boxplot(fill = fill_color, alpha = 0.7) +
      theme_minimal() +
      labs(title = paste(title_prefix, y_var, "by", x_var), x = x_var, y = y_var)
  } else {
    return (Boxplot(as.formula(paste("~", x_var)), data = data, col = fill_color))
  }
}

# helper function for summary for categorical/factor variables
helper.group_summary <- function(data, group_var, target_var) {
  data %>%
    group_by(across(all_of(group_var))) %>%
    summarise(
      mean = mean(.data[[target_var]], na.rm = TRUE),
      sd = sd(.data[[target_var]], na.rm = TRUE),
      n = n()
    ) %>%
    arrange(desc(mean))
}

# helper function for anova for basic EDA of factor_variables
helper.run_anova <- function(data, factor_var, target_var) {
  formula <- as.formula(paste(target_var, "~", factor_var)) #s/o CS131
  anova_result <- aov(formula, data = data)
  summary(anova_result)
}

helper.eda_factor <- function(data, factor_var, target_var, color){
  #this gives the boxplot
  print(helper.create_boxplot(data, factor_var, target_var, color))
  
  #now annova
  print(helper.run_anova(data, factor_var, target_var))
  
  #finally summary
  helper.group_summary(data, factor_var, target_var)
}

helper.analyze_variable <- function(data, var_name, dependent_var = "Recent.Close.Price") {

  # histogram with normal and density curves
  print(helper.plot_histograms(data = data, variable_name = var_name))
  
  # 2. QQ Plot
  p2 <- ggplot(data, aes(sample = .data[[var_name]])) +
    stat_qq() +
    stat_qq_line(color = "red") +
    theme_minimal() +
    labs(title = paste("Q-Q Plot of", var_name))
  print(p2)
  
  # Box Plot
  p3 <- helper.create_boxplot(data = data, x_var = var_name)
  
  # density Plot
  p5 <- ggplot(data, aes_string(x = var_name)) +
        geom_density(fill = "lightblue", alpha = 0.7) +
        stat_function(fun = dnorm, 
                      args = list(mean = mean(data[[var_name]], na.rm = TRUE),
                                  sd = sd(data[[var_name]], na.rm = TRUE)),
                      color = "red", linewidth = 1) +
        theme_minimal() +
        labs(title = paste("Density Plot of", var_name),
             x = var_name,
             y = "Density")
  print(p5)

  
  # Print summary statistics
  cat("\nSummary Statistics for", var_name, ":\n")
  print(summary(data[[var_name]]))
}


detect_nonlinearities_and_outliers <- function(data, dependent_var, predictor_var) {
  # Helper: Perform power transformation and return lambda
  get_power_transform <- function(data, predictor_var) {
    formula <- as.formula(paste(predictor_var, "~", 1))
    power_transform <- powerTransform(formula, data = data)
    lambda <-power_transform$lambda
    return(lambda)
  }
  
  # Helper: Detect influential outliers using Cook's distance
  detect_outliers <- function(model, data) {
    cooks_d <- cooks.distance(model)
    threshold <- 4 * mean(cooks_d, na.rm = TRUE)
    influential_points <- which(cooks_d > threshold)
    outliers <- data[influential_points, ]
    return(list(outliers = outliers, cooks_d = cooks_d))
  }
  
  # Step 1: Power Transformation
  lambda <- get_power_transform(data, predictor_var)
  transformed_var <- paste0(predictor_var, "_transformed")
  data[[transformed_var]] <- bcPower(data[[predictor_var]], lambda)
  
  # Step 2: Fit the three models
  model1 <- lm(as.formula(paste(dependent_var, "~", predictor_var)), data = data)
  model2 <- lm(as.formula(paste(dependent_var, "~", transformed_var)), data = data)
  
  # Detect outliers for Model 1
  outlier_info <- detect_outliers(model1, data)
  data_without_outliers <- anti_join(data, outlier_info$outliers, by = names(data))
  model3 <- lm(as.formula(paste(dependent_var, "~", predictor_var)), data = data_without_outliers)
  
  # Step 3: Summarize models
  model_summaries <- list(
    base_model = summary(model1),
    transformed_model = summary(model2),
    model_without_outliers = summary(model3)
  )
  
  # Step 4: Visualization
  par(mfrow = c(2, 2))  # Set up 2x2 plotting area
  
  # Scatterplot: Original vs. Transformed Predictor
  plot(data[[predictor_var]], data[[transformed_var]], 
       main = "Original vs Transformed Predictor", 
       xlab = predictor_var, 
       ylab = transformed_var)
  
  # Residual plots for Model 1
  plot(residuals(model1), main = "Residuals of Base Model", xlab = "Index", ylab = "Residuals")
  
  # Residual plots for Model 3 (without outliers)
  plot(residuals(model3), main = "Residuals of Model w/o Outliers", xlab = "Index", ylab = "Residuals")
  
  # Cook's distance plot
  plot(outlier_info$cooks_d, type = "h", main = "Cook's Distance", 
       xlab = "Index", ylab = "Cook's Distance")
  abline(h = 3 * mean(outlier_info$cooks_d, na.rm = TRUE), col = "red", lty = 2)
  
  
  # Plot the models themselves
  par(mfrow = c(2, 2))  # Reset plotting area
  plot(model1, main = "Base Model")
  plot(model2, main = "Transformed Model")
  plot(model3, main = "Model without Outliers")
  
  # Scatterplots for each model
  p1 <- ggplot(data, aes_string(x = predictor_var, y = dependent_var)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "blue", se = TRUE) +
    theme_minimal() +
    labs(title = "Base Model",
         x = predictor_var,
         y = dependent_var)
  
  p2 <- ggplot(data, aes_string(x = transformed_var, y = dependent_var)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "blue", se = TRUE) +
    theme_minimal() +
    labs(title = "Transformed Model",
         x = transformed_var,
         y = dependent_var)
  
  p3 <- ggplot(data_without_outliers, aes_string(x = predictor_var, y = dependent_var)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "blue", se = TRUE) +
    theme_minimal() +
    labs(title = "Model without Outliers",
         x = predictor_var,
         y = dependent_var)
  
  gridExtra::grid.arrange(p1, p2, p3, ncol = 1)
  
  # Step 5: Return results
  results <- list(
    lambda = lambda,
    model_summaries = model_summaries,
    influential_outliers = outlier_info$outliers,
    data_without_outliers = data_without_outliers
  )
  
  return(results)
}


helper.remove_specific_rows <- function(data, var1_pattern, var2_pattern) {
  # Create logical index for rows to keep
  keep_rows <- !(grepl(var1_pattern, data$var1) & grepl(var2_pattern, data$var2) |
                 grepl(var1_pattern, data$var2) & grepl(var2_pattern, data$var1))
  
  # Subset the data to keep only the desired rows
  filtered_data <- data[keep_rows, ]
  
  return(filtered_data)
}

helper.diagnostic_plots <- function(model) {
  par(mfrow = c(2, 2))  # Set up a 2x2 plotting layout
  
  # 1. Cook's Distance Plot
  plot(model, which = 4, main = "Cook's Distance Plot")
  abline(h = 4/length(model$residuals), col = "red", lty = 2)  # Add threshold line
  legend("topright", legend = "Threshold: 4/n", col = "red", lty = 2, bty = "n")
  
  # 2. Residuals vs Fitted Plot
  plot(model$fitted.values, model$residuals, 
       main = "Residuals vs Fitted",
       xlab = "Fitted Values", ylab = "Residuals", pch = 20, col = "blue")
  abline(h = 0, col = "red", lty = 2)
  
  # 3. QQ-Plot
  qqnorm(model$residuals, main = "QQ-Plot")
  qqline(model$residuals, col = "red", lty = 2)
  
}

Data Preprocessing and Set-up

stock_data <- read.csv("enhanced_stock_dataset.csv")
na_counts_base <- colSums(is.na(stock_data))
print(na_counts_base)
##                    Ticker                    Sector                  Industry 
##                         0                         0                         0 
##                    Region                Market.Cap Market.Cap.Classification 
##                         0                         0                         0 
## Volatility.Classification           Growth.vs.Value                 P.E.Ratio 
##                         0                         0                       140 
##        Dividend.Yield....             X52.Week.High              X52.Week.Low 
##                         0                         0                         0 
##                      Beta                Avg.Volume        Recent.Close.Price 
##                        11                         0                         0 
##                       EPS                   Revenue                Net.Income 
##                         0                         2                         0 
##      Debt.to.Equity.Ratio                       ROE                 P.B.Ratio 
##                        99                        38                        38 
##            Free.Cash.Flow           Analyst.Ratings      Insider.Transactions 
##                        82                         0                         0
stock_data <- na.omit(stock_data)

if (sum(is.na(stock_data)) > 0) {
  print("missing values present in data")
  #double checking no missing values in data
}
 stock_data$X52.Week.High <- NULL
 stock_data$X52.Week.Low <- NULL
 stock_data$Market.Cap <- NULL
helper.plot_histograms( data = stock_data, color = "green", variable_name = "Recent.Close.Price")

# 1 Variable Selection ### Boruta Algorithm Selection

set.seed(123)
# for some reason R is single threaded for BORUTA so we will set up threading manually because i am not waiting for boruta analysis 
cl <- makeCluster(detectCores() - 1) # Use all but one core
registerDoParallel(cl)

boruta_result_stock <- Boruta(Recent.Close.Price ~ ., data = stock_data, doTrace = 2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +2.6 secs:
##  confirmed 11 attributes: Avg.Volume, Dividend.Yield...., EPS, Free.Cash.Flow, Growth.vs.Value and 6 more;
##  rejected 3 attributes: Analyst.Ratings, Insider.Transactions, Region;
##  still have 6 attributes left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
## After 15 iterations, +3.4 secs:
##  rejected 1 attribute: Ticker;
##  still have 5 attributes left.
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
## After 19 iterations, +4.4 secs:
##  rejected 2 attributes: Industry, Volatility.Classification;
##  still have 3 attributes left.
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
## After 28 iterations, +6.1 secs:
##  rejected 1 attribute: Sector;
##  still have 2 attributes left.
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
## After 76 iterations, +15 secs:
##  confirmed 1 attribute: Beta;
##  still have 1 attribute left.
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
## After 83 iterations, +17 secs:
##  confirmed 1 attribute: Debt.to.Equity.Ratio;
##  no more attributes left.
#stop cluster 
stopCluster(cl)
registerDoSEQ()

plot(boruta_result_stock, xlab = "", xaxt = "n", main = "Variable Importance via Boruta")
lz_stock <- lapply(1:ncol(boruta_result_stock$ImpHistory), function(i)
  boruta_result_stock$ImpHistory[is.finite(boruta_result_stock$ImpHistory[, i]), i])
names(lz_stock) <- colnames(boruta_result_stock$ImpHistory)
Labels_stock <- sort(sapply(lz_stock, median))
axis(side = 1, las = 2, labels = names(Labels_stock),
     at = 1:ncol(boruta_result_stock$ImpHistory), cex.axis = 0.7)

important_vars_stock <- getSelectedAttributes(boruta_result_stock, withTentative = TRUE)
importance_scores_stock <- attStats(boruta_result_stock)


sorted_scores_stock <- importance_scores_stock[order(-importance_scores_stock$meanImp), ]
print(sorted_scores_stock)
##                               meanImp   medianImp     minImp    maxImp
## EPS                       25.01231477 25.15203141 21.4834585 27.842510
## P.E.Ratio                 11.02554761 11.21769856  6.8088376 13.350420
## P.B.Ratio                 10.89676237 10.86590761  7.9868108 14.105185
## Avg.Volume                 8.73858498  8.77847636  7.0982706 10.599089
## Market.Cap.Classification  8.30883087  8.32893204  5.5445765 11.097122
## Net.Income                 7.49448151  7.49718636  5.4265206  9.397697
## Revenue                    6.36979998  6.35140267  3.6918324  7.941068
## Dividend.Yield....         5.93230512  5.97045699  3.0633932  8.262782
## Growth.vs.Value            5.90310185  5.92254450  3.3886229  8.133442
## ROE                        5.45336923  5.43734902  3.5465118  7.829754
## Free.Cash.Flow             5.11396132  5.06443586  3.5092127  6.546770
## Debt.to.Equity.Ratio       2.59059851  2.60739645  0.1912753  5.057399
## Beta                       2.52598841  2.58644623  0.5517994  4.773966
## Sector                     0.92621503  0.92198067 -0.8894823  2.854915
## Insider.Transactions       0.75721628  0.99035980 -1.3162031  1.732280
## Volatility.Classification  0.61263422  0.45712574 -0.9238392  3.085234
## Industry                   0.57789013  0.33665635 -1.7478825  3.320454
## Ticker                     0.10154647 -0.01356575 -1.7902175  1.953404
## Analyst.Ratings            0.09234995  0.14353443 -1.1116901  1.625413
## Region                     0.00000000  0.00000000  0.0000000  0.000000
##                             normHits  decision
## EPS                       1.00000000 Confirmed
## P.E.Ratio                 1.00000000 Confirmed
## P.B.Ratio                 1.00000000 Confirmed
## Avg.Volume                1.00000000 Confirmed
## Market.Cap.Classification 1.00000000 Confirmed
## Net.Income                1.00000000 Confirmed
## Revenue                   1.00000000 Confirmed
## Dividend.Yield....        1.00000000 Confirmed
## Growth.vs.Value           1.00000000 Confirmed
## ROE                       1.00000000 Confirmed
## Free.Cash.Flow            1.00000000 Confirmed
## Debt.to.Equity.Ratio      0.68674699 Confirmed
## Beta                      0.72289157 Confirmed
## Sector                    0.06024096  Rejected
## Insider.Transactions      0.00000000  Rejected
## Volatility.Classification 0.02409639  Rejected
## Industry                  0.02409639  Rejected
## Ticker                    0.01204819  Rejected
## Analyst.Ratings           0.00000000  Rejected
## Region                    0.00000000  Rejected
quantitative_vars <- names(stock_data)[sapply(stock_data, is.numeric)]
top_quantitative_scores <- sorted_scores_stock[rownames(sorted_scores_stock) %in% quantitative_vars, ]
top_5_vars_stock <- rownames(top_quantitative_scores)[1:5]
print(top_5_vars_stock)
## [1] "EPS"        "P.E.Ratio"  "P.B.Ratio"  "Avg.Volume" "Net.Income"

I ran Boruta on the variables and was able to come up with 5 quantitative variables that would be best suited for my MLR model. I chose the five highest quantitative variables by meanImportance from the Boruta model. The five variables I will be choosing are "EPS", "P.E.Ratio", "P.B.Ratio", "Avg.Volume", "Net.Income".

Factor Variable Selection

# R is pretty cool because it auto does one-hot encoding for factor variables so that saves us a little code


#this is doing all of the EDA for our factor variables 
helper.eda_factor(stock_data, "Sector", "Recent.Close.Price", "blue")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##              Df  Sum Sq Mean Sq F value Pr(>F)  
## Sector       10  385350   38535   1.947 0.0377 *
## Residuals   415 8215020   19795                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 11 × 4
##    Sector                  mean    sd     n
##    <chr>                  <dbl> <dbl> <int>
##  1 Financial Services     183.  148.     33
##  2 Industrials            146.  157.     85
##  3 Technology             136.  130.     64
##  4 Consumer Defensive     131.  220.     33
##  5 Healthcare             130.  102.     47
##  6 Consumer Cyclical      122.  121.     52
##  7 Basic Materials        122.  133.     25
##  8 Real Estate            101.  187.     28
##  9 Communication Services  81.7 109.     12
## 10 Utilities               76.3  40.6    25
## 11 Energy                  47.6  48.4    22
helper.eda_factor(stock_data, "Industry", "Recent.Close.Price", "green")

##              Df  Sum Sq Mean Sq F value Pr(>F)
## Industry    114 2493519   21873   1.114  0.234
## Residuals   311 6106851   19636
## # A tibble: 115 × 4
##    Industry                          mean    sd     n
##    <chr>                            <dbl> <dbl> <int>
##  1 Home Improvement Retail           429.   NA      1
##  2 Industrial Distribution           394.  544.     4
##  3 Healthcare Plans                  338.   NA      1
##  4 Financial Data & Stock Exchanges  327.  145.     6
##  5 Discount Stores                   324.  432.     4
##  6 Insurance Brokers                 312.   NA      1
##  7 Diagnostics & Research            275.  130.     4
##  8 REIT - Specialty                  273.  401.     5
##  9 Auto & Truck Dealerships          260.   NA      1
## 10 Capital Markets                   259.   NA      1
## # ℹ 105 more rows
helper.eda_factor(stock_data, "Market.Cap.Classification", "Recent.Close.Price", "purple")

##                            Df  Sum Sq Mean Sq F value Pr(>F)    
## Market.Cap.Classification   2 1435734  717867   42.38 <2e-16 ***
## Residuals                 423 7164636   16938                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 3 × 4
##   Market.Cap.Classification  mean    sd     n
##   <chr>                     <dbl> <dbl> <int>
## 1 Large Cap                 186.  171.    187
## 2 Mid Cap                   100.  100.    159
## 3 Small Cap                  35.9  38.1    80
helper.eda_factor(stock_data, "Volatility.Classification", "Recent.Close.Price", "lightblue")

##                            Df  Sum Sq Mean Sq F value Pr(>F)  
## Volatility.Classification   2  153470   76735   3.843 0.0222 *
## Residuals                 423 8446900   19969                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 3 × 4
##   Volatility.Classification  mean    sd     n
##   <chr>                     <dbl> <dbl> <int>
## 1 Low Volatility             146.  167.   104
## 2 Medium Volatility          138.  159.   148
## 3 High Volatility            104.  103.   174
helper.eda_factor(stock_data, "Growth.vs.Value", "Recent.Close.Price", "darkblue")

##                  Df  Sum Sq Mean Sq F value Pr(>F)  
## Growth.vs.Value   1  121359  121359   6.069 0.0142 *
## Residuals       424 8479010   19998                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 2 × 4
##   Growth.vs.Value  mean    sd     n
##   <chr>           <dbl> <dbl> <int>
## 1 Growth          134.  152.    341
## 2 Value            92.2  88.4    85
helper.eda_factor(stock_data, "Analyst.Ratings", "Recent.Close.Price", "pink")

##                  Df  Sum Sq Mean Sq F value Pr(>F)
## Analyst.Ratings   3   43510   14503   0.715  0.543
## Residuals       422 8556860   20277
## # A tibble: 4 × 4
##   Analyst.Ratings  mean    sd     n
##   <chr>           <dbl> <dbl> <int>
## 1 sell            162.  214.     25
## 2 buy             125.  132.    324
## 3 strongBuy       123.  169.     64
## 4 strongSell       98.3  64.9    13
helper.eda_factor(stock_data, "Insider.Transactions", "Recent.Close.Price", "yellow")

##                       Df  Sum Sq Mean Sq F value Pr(>F)
## Insider.Transactions   2   24936   12468   0.615  0.541
## Residuals            423 8575434   20273
## # A tibble: 3 × 4
##   Insider.Transactions  mean    sd     n
##   <chr>                <dbl> <dbl> <int>
## 1 Negative              132.  135.   267
## 2 Positive              117.  156.   148
## 3 Neutral               108.  100.    11

After conducting EDA for all of our factor variables, there are only two variables that I feel comfortable selecting. The first is Market.Cap.Classification and the second is Growth.vs.Value. I eliminated Industry first as it has too many categories and would just overfit our model despite its low p-value from the ANOVA test. I then elimated Analyst.Ratings and Insider.Transactions due to them having non-significant p-values through ANOVA. Then I was left with 4 other variables but both Sector and Volatility.Classification both were rejected by Boruta and had higher p-values then the chosen variables which is why I chose to reject them. I was at first surprised by Sector failing Boruta, but then I realized that stock price is not an indication of growth rates in a sector and since we have no growth variables, the strong performance of technology stocks this year compared to energy stocks was not leveraged the way I thought it would be.

2. Descriptive Analysis

##Analysis for EPS

helper.analyze_variable(stock_data, "EPS")

## 
## Summary Statistics for EPS :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.030   1.383   3.320   5.160   6.603  51.210
results <- detect_nonlinearities_and_outliers(
  data = stock_data, 
  dependent_var = "Recent.Close.Price", 
  predictor_var = "EPS"
)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda  # Optimal lambda
##       Y1 
## 0.170388
results$model_summaries  # Summaries of the three models
## $base_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -425.60  -38.96  -20.55   23.90  756.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6797     6.0882   6.517 2.03e-10 ***
## EPS          16.7384     0.7527  22.237  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.77 on 424 degrees of freedom
## Multiple R-squared:  0.5384, Adjusted R-squared:  0.5373 
## F-statistic: 494.5 on 1 and 424 DF,  p-value: < 2.2e-16
## 
## 
## $transformed_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.58  -63.26  -18.84   30.21  832.17 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       40.321      7.006   5.755 1.66e-08 ***
## EPS_transformed   66.792      3.679  18.156  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.8 on 424 degrees of freedom
## Multiple R-squared:  0.4374, Adjusted R-squared:  0.4361 
## F-statistic: 329.6 on 1 and 424 DF,  p-value: < 2.2e-16
## 
## 
## $model_without_outliers
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -208.17  -35.08  -16.07   23.68  373.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.0134     4.7356    6.76  4.7e-11 ***
## EPS          18.3237     0.6903   26.54  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.24 on 414 degrees of freedom
## Multiple R-squared:  0.6299, Adjusted R-squared:  0.629 
## F-statistic: 704.6 on 1 and 414 DF,  p-value: < 2.2e-16
results$influential_outliers # Name of the outliers
##     Ticker                 Sector                        Industry        Region
## 95      CB     Financial Services Insurance - Property & Casualty North America
## 98     GWW            Industrials         Industrial Distribution North America
## 111    THC             Healthcare         Medical Care Facilities North America
## 130   CHTR Communication Services                Telecom Services North America
## 151   EQIX            Real Estate                REIT - Specialty North America
## 199    NEU        Basic Materials             Specialty Chemicals North America
## 268   COST     Consumer Defensive                 Discount Stores North America
## 451   CINF     Financial Services Insurance - Property & Casualty North America
## 486    HOV      Consumer Cyclical        Residential Construction North America
## 635    MTH      Consumer Cyclical        Residential Construction North America
##     Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 95                  Large Cap            Low Volatility           Value
## 98                  Large Cap         Medium Volatility          Growth
## 111                 Large Cap           High Volatility           Value
## 130                 Large Cap         Medium Volatility           Value
## 151                 Large Cap            Low Volatility          Growth
## 199                   Mid Cap            Low Volatility           Value
## 268                 Large Cap            Low Volatility          Growth
## 451                 Large Cap            Low Volatility           Value
## 486                 Small Cap           High Volatility           Value
## 635                   Mid Cap           High Volatility           Value
##     P.E.Ratio Dividend.Yield....  Beta Avg.Volume Recent.Close.Price   EPS
## 95  11.833198               1.26 0.681 1636872.91             288.73 24.40
## 98  32.682755               0.68 1.151  233209.16            1205.34 36.88
## 111  4.518049               0.00 2.150 1210783.27             142.68 31.58
## 130 12.436247               0.00 1.036 1351296.02             396.97 31.92
## 151 88.501350               1.74 0.701  525569.72             981.48 11.09
## 199 11.848990               1.87 0.512   36585.66             533.56 45.03
## 268 58.830505               0.48 0.789 1990399.60             971.88 16.52
## 451  8.209040               2.03 0.666  689332.27             159.83 19.47
## 486  6.130652               0.00 2.607   82824.70             196.61 32.07
## 635  8.641791               1.57 1.822  411294.02             191.07 22.11
##          Revenue Net.Income Debt.to.Equity.Ratio     ROE P.B.Ratio
## 95   54697000960 9996999680               31.706 0.16125  1.769667
## 98   16931999744 1828999936               83.325 0.52611 16.759687
## 111  20971999232 3126000128              159.393 0.59565  3.538427
## 130  54869999616 4674999808              533.472 0.32965  4.003601
## 151   8401490944 1056177984              140.964 0.08267  6.969451
## 199   2775260928  430552992               84.902 0.36990  3.752523
## 268 254453006336 7367000064               42.118 0.30267 18.231411
## 451  12154999808 3070000128                6.331 0.25135  1.809731
## 486   2912312064  221342000              167.111 0.41306  2.093845
## 635   6433726976  812387968               27.268 0.17192  1.374377
##     Free.Cash.Flow Analyst.Ratings Insider.Transactions EPS_transformed
## 95     13009625088             buy             Negative        4.245802
## 98      1546249984       strongBuy             Positive        4.983382
## 111     3382625024             buy             Negative        4.700260
## 130     3804124928             buy             Negative        4.719562
## 151     2946253568             buy             Negative        2.974147
## 199      368575488            sell             Negative        5.358926
## 268     4467500032             buy             Negative        3.595490
## 451     3040124928             buy             Positive        3.864194
## 486     -103003128            sell             Positive        4.728024
## 635     -350566624             buy             Negative        4.077370

In my opinion for EPS there is no reason to do a powerTransform as the R^2 values after the powerTransform actually decreases and although it definately makes it more normally distributed, I do not think it is statistically significant in the premise of making a multiple linear regression model. The next question then comes to outliers. We used Cook’s Distance to show that we have 10 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion.

Analysis for PB Ratio

helper.analyze_variable(stock_data, "P.B.Ratio")

## 
## Summary Statistics for P.B.Ratio :
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.04988   1.78783   2.97480   5.59541   5.47365 138.84380
results <- detect_nonlinearities_and_outliers(
  data = stock_data, 
  dependent_var = "Recent.Close.Price", 
  predictor_var = "P.B.Ratio"
)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda  # Optimal lambda
##         Y1 
## -0.1097533
results$model_summaries  # Summaries of the three models
## $base_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333.31  -78.39  -36.82   26.80 1037.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.9936     7.5509   13.90  < 2e-16 ***
## P.B.Ratio     3.7634     0.6433    5.85 9.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137 on 424 degrees of freedom
## Multiple R-squared:  0.07469,    Adjusted R-squared:  0.07251 
## F-statistic: 34.22 on 1 and 424 DF,  p-value: 9.833e-09
## 
## 
## $transformed_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -225.79  -65.24  -26.82   25.92  978.93 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             45.450     10.584   4.294 2.17e-05 ***
## P.B.Ratio_transformed   74.633      7.895   9.454  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129.4 on 424 degrees of freedom
## Multiple R-squared:  0.1741, Adjusted R-squared:  0.1721 
## F-statistic: 89.37 on 1 and 424 DF,  p-value: < 2.2e-16
## 
## 
## $model_without_outliers
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221.12  -66.45  -28.02   29.01  428.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  87.1841     6.2032   14.05  < 2e-16 ***
## P.B.Ratio     5.9394     0.7071    8.40 7.04e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103.9 on 418 degrees of freedom
## Multiple R-squared:  0.1444, Adjusted R-squared:  0.1424 
## F-statistic: 70.56 on 1 and 418 DF,  p-value: 7.036e-16
results$influential_outliers # Name of the outliers
##     Ticker             Sector                      Industry        Region
## 98     GWW        Industrials       Industrial Distribution North America
## 151   EQIX        Real Estate              REIT - Specialty North America
## 268   COST Consumer Defensive               Discount Stores North America
## 494   VRSK        Industrials           Consulting Services North America
## 622    GHC Consumer Defensive Education & Training Services North America
## 643   FTNT         Technology     Software - Infrastructure North America
##     Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98                  Large Cap         Medium Volatility          Growth
## 151                 Large Cap            Low Volatility          Growth
## 268                 Large Cap            Low Volatility          Growth
## 494                 Large Cap         Medium Volatility          Growth
## 622                   Mid Cap         Medium Volatility          Growth
## 643                 Large Cap         Medium Volatility          Growth
##     P.E.Ratio Dividend.Yield....  Beta Avg.Volume Recent.Close.Price   EPS
## 98   32.68276               0.68 1.151  233209.16            1205.34 36.88
## 151  88.50135               1.74 0.701  525569.72             981.48 11.09
## 268  58.83051               0.48 0.789 1990399.60             971.88 16.52
## 494  45.47295               0.53 0.862  794768.92             294.21  6.47
## 622  18.18239               0.74 1.114   15381.27             931.12 51.21
## 643  47.76382               0.00 0.995 5533398.01              95.05  1.99
##          Revenue Net.Income Debt.to.Equity.Ratio     ROE  P.B.Ratio
## 98   16931999744 1828999936               83.325 0.52611  16.759687
## 151   8401490944 1056177984              140.964 0.08267   6.969451
## 268 254453006336 7367000064               42.118 0.30267  18.231411
## 494   2823300096  930300032             1069.413 2.65609 138.843800
## 622   4711917056  227524992               32.787 0.05939   1.007689
## 643   5710799872 1529900032              118.588 3.11525  80.143340
##     Free.Cash.Flow Analyst.Ratings Insider.Transactions P.B.Ratio_transformed
## 98      1546249984       strongBuy             Positive            2.42459169
## 151     2946253568             buy             Negative            1.74861823
## 268     4467500032             buy             Negative            2.48607870
## 494      763787520             buy             Negative            3.80942452
## 622      213084880            sell             Positive            0.00765657
## 643     1648400000             buy             Negative            3.47980961

In my opinion for PB ratio there is a reason to do a powerTransform as the R^2 values after the powerTransform actually increases by a significant amount and it definitely makes it more normally distributed. I would argue this transformation is helpful in making our model more predictive. The one question that I had was that it leads to us having a negative PB ratio for some instances which economically infeasible, but at the same time because this is a transformed variable I do not think this is an issue as all of the variables would be scaled with this transformation. The next question then comes to outliers. We used Cook’s Distance to show that we have 6 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion and is actually lower then the transformed linear model. ## Analysis for PE Ratio

helper.analyze_variable(stock_data, "P.E.Ratio")

## 
## Summary Statistics for P.E.Ratio :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.129   16.942   25.877   42.634   38.645 1029.182
results <- detect_nonlinearities_and_outliers(
  data = stock_data, 
  dependent_var = "Recent.Close.Price", 
  predictor_var = "P.E.Ratio"
)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda  # Optimal lambda
##       Y1 
## -0.15614
results$model_summaries  # Summaries of the three models
## $base_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126.29  -85.98  -42.26   29.92 1079.06 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 127.02916    7.84473  16.193   <2e-16 ***
## P.E.Ratio    -0.02293    0.08755  -0.262    0.794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.4 on 424 degrees of freedom
## Multiple R-squared:  0.0001618,  Adjusted R-squared:  -0.002196 
## F-statistic: 0.0686 on 1 and 424 DF,  p-value: 0.7935
## 
## 
## $transformed_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -164.39  -80.06  -40.77   31.01 1074.16 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              29.36      37.71   0.778  0.43674   
## P.E.Ratio_transformed    37.87      14.52   2.607  0.00945 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.3 on 424 degrees of freedom
## Multiple R-squared:  0.01578,    Adjusted R-squared:  0.01346 
## F-statistic: 6.797 on 1 and 424 DF,  p-value: 0.009452
## 
## 
## $model_without_outliers
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.92  -64.72  -25.04   35.06  324.31 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.61794    5.09740  20.524   <2e-16 ***
## P.E.Ratio     0.00686    0.06900   0.099    0.921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.69 on 403 degrees of freedom
## Multiple R-squared:  2.452e-05,  Adjusted R-squared:  -0.002457 
## F-statistic: 0.009883 on 1 and 403 DF,  p-value: 0.9209
results$influential_outliers # Name of the outliers
##     Ticker             Sector                            Industry        Region
## 15     MCO Financial Services    Financial Data & Stock Exchanges North America
## 32    MUSA  Consumer Cyclical                    Specialty Retail North America
## 98     GWW        Industrials             Industrial Distribution North America
## 109    PEN         Healthcare                     Medical Devices North America
## 151   EQIX        Real Estate                    REIT - Specialty North America
## 153   VICR         Technology               Electronic Components North America
## 188    FDS Financial Services    Financial Data & Stock Exchanges North America
## 199    NEU    Basic Materials                 Specialty Chemicals North America
## 238   INCY         Healthcare                       Biotechnology North America
## 268   COST Consumer Defensive                     Discount Stores North America
## 296     ZI         Technology              Software - Application North America
## 300    AMP Financial Services                    Asset Management North America
## 393   MPWR         Technology                      Semiconductors North America
## 395     DE        Industrials Farm & Heavy Construction Machinery North America
## 431   CVCO  Consumer Cyclical            Residential Construction North America
## 515    LIN    Basic Materials                 Specialty Chemicals North America
## 528    LMT        Industrials                 Aerospace & Defense North America
## 544   ADBE         Technology           Software - Infrastructure North America
## 576    NOC        Industrials                 Aerospace & Defense North America
## 622    GHC Consumer Defensive       Education & Training Services North America
## 644   SNPS         Technology           Software - Infrastructure North America
##     Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 15                  Large Cap           High Volatility          Growth
## 32                  Large Cap            Low Volatility          Growth
## 98                  Large Cap         Medium Volatility          Growth
## 109                   Mid Cap            Low Volatility          Growth
## 151                 Large Cap            Low Volatility          Growth
## 153                   Mid Cap           High Volatility          Growth
## 188                 Large Cap            Low Volatility          Growth
## 199                   Mid Cap            Low Volatility           Value
## 238                 Large Cap            Low Volatility          Growth
## 268                 Large Cap            Low Volatility          Growth
## 296                   Mid Cap         Medium Volatility          Growth
## 300                 Large Cap           High Volatility          Growth
## 393                 Large Cap         Medium Volatility          Growth
## 395                 Large Cap         Medium Volatility          Growth
## 431                   Mid Cap           High Volatility          Growth
## 515                 Large Cap         Medium Volatility          Growth
## 528                 Large Cap            Low Volatility          Growth
## 544                 Large Cap           High Volatility          Growth
## 576                 Large Cap            Low Volatility          Growth
## 622                   Mid Cap         Medium Volatility          Growth
## 644                 Large Cap         Medium Volatility          Growth
##     P.E.Ratio Dividend.Yield....  Beta Avg.Volume Recent.Close.Price   EPS
## 15   45.66027               0.68 1.289  739825.90             499.98 10.95
## 32   22.62701               0.35 0.756  195428.69             547.80 24.21
## 98   32.68276               0.68 1.151  233209.16            1205.34 36.88
## 109 277.40910               0.00 0.514  374408.76             244.12  0.88
## 151  88.50135               1.74 0.701  525569.72             981.48 11.09
## 153 532.10000               0.00 1.487  231219.92              53.21  0.10
## 188  35.32541               0.85 0.752  265147.01             490.67 13.89
## 199  11.84899               1.87 0.512   36585.66             533.56 45.03
## 238 828.77770               0.00 0.710 2351324.30              74.59  0.09
## 268  58.83051               0.48 0.789 1990399.60             971.88 16.52
## 296 364.66666               0.00 1.037 7082583.67              10.94  0.03
## 300  21.93236               1.03 1.334  473387.65             573.97 26.17
## 393  64.06773               0.88 1.148  601479.68             567.64  8.86
## 395  18.18501               1.45 0.935 1489247.81             465.90 25.62
## 431  29.06780               0.00 1.247   65681.67             514.50 17.70
## 515  34.97648               1.21 0.949 1864140.24             460.99 13.18
## 528  19.15376               2.49 0.481 1077877.29             529.41 27.64
## 544  43.68586               0.00 1.299 3166801.59             515.93 11.81
## 576  30.20666               1.68 0.346  844117.93             489.65 16.21
## 622  18.18239               0.74 1.114   15381.27             931.12 51.21
## 644  57.45782               0.00 1.076 1052858.96             558.49  9.72
##          Revenue Net.Income Debt.to.Equity.Ratio     ROE P.B.Ratio
## 15    6896000000 2003000064              197.861 0.54016 23.204159
## 32   18275201024  510000000              282.568 0.60925 13.362931
## 98   16931999744 1828999936               83.325 0.52611 16.759687
## 109   1163776000   34547000               20.528 0.03129  8.480805
## 151   8401490944 1056177984              140.964 0.08267  6.969451
## 153    355544000    4551000                1.324 0.00842  4.329889
## 188   2203056128  537126016               82.338 0.30411  9.737255
## 199   2775260928  430552992               84.902 0.36990  3.752523
## 238   4075859968   32482000                1.151 0.00802  4.534898
## 268 254453006336 7367000064               42.118 0.30267 18.231411
## 296   1221600000    9000000               82.474 0.00459  2.248715
## 300  17444999168 2707000064               67.402 0.56751  9.764050
## 393   2039447040  434241984                0.693 0.20214 11.774077
## 395  55955001344 8224000000              288.043 0.35481  5.527280
## 431   1851947008  148164992                3.871 0.14376  4.028690
## 515  33024999424 6383000064               54.828 0.16190  5.603447
## 528  71295000576 6674999808              268.347 0.81037 17.353153
## 544  20946999296 5360000000               41.788 0.35355 15.784916
## 576  40985001984 2375000064              122.722 0.15484  4.841165
## 622   4711917056  227524992               32.787 0.05939  1.007689
## 644   6483438080 1511323008                8.700 0.21819 11.128180
##     Free.Cash.Flow Analyst.Ratings Insider.Transactions P.E.Ratio_transformed
## 15      1952999936             buy             Negative              2.877834
## 32       339112512             buy             Positive              2.469241
## 98      1546249984       strongBuy             Positive              2.688814
## 109      125270624             buy             Negative              3.743680
## 151     2946253568             buy             Negative              3.224055
## 153       26165500             buy             Positive              4.000981
## 188      555261760            sell             Negative              2.733652
## 199      368575488            sell             Negative              2.050982
## 238      482474624       strongBuy             Negative              4.161656
## 268     4467500032             buy             Negative              3.014662
## 296      431324992             buy             Negative              3.854913
## 300     4073125120       strongBuy             Negative              2.450035
## 393      516699488             buy             Negative              3.059501
## 395     3000000000             buy             Positive              2.332637
## 431       87199872             buy             Negative              2.620182
## 515     4216750080             buy             Positive              2.727958
## 528     5304750080             buy             Negative              2.365501
## 544     6618500096             buy             Negative              2.853408
## 576     2600624896             buy             Positive              2.642823
## 622      213084880            sell             Positive              2.332545
## 644     -144225504             buy             Positive              3.002143

In my opinion for PE ratio there is a reason to do a powerTransform. This is because when we have the base model without outliers the linear model does not even think PE Ratio is a statistically significant variable, but the transformed model says that we do. We know this as the p value is less than 0.05. This is why from now on I will be proceeding with the transformed version of PE for the most part. The next question then comes to outliers. We used Cook’s Distance to show that we have 21 outliers in our model that we identified that have a standard deviation of greater than 4 then our mean Cook’s Distance. I do not think it is worth removing these outliers however as the R^2 does not change drastically with their exclusion and is actually lower then the transformed linear model. Additionally that would mean removing around 5% of our data which would be a lot.

Analysis for Avg Volume

helper.analyze_variable(stock_data, "Avg.Volume")

## 
## Summary Statistics for Avg.Volume :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    12418   314911   765520  2265197  1734321 96770663
data = stock_data
dependent_var = "Recent.Close.Price"
predictor_var = "Avg.Volume"
results <- detect_nonlinearities_and_outliers(
  data = stock_data, 
  dependent_var = "Recent.Close.Price", 
  predictor_var = "Avg.Volume"
)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda  # Optimal lambda
##          Y1 
## -0.01638639
results$model_summaries  # Summaries of the three models
## $base_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.50  -85.52  -42.69   29.98 1079.13 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.262e+02  7.281e+00  17.336   <2e-16 ***
## Avg.Volume  -7.687e-08  1.026e-06  -0.075     0.94    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.4 on 424 degrees of freedom
## Multiple R-squared:  1.323e-05,  Adjusted R-squared:  -0.002345 
## F-statistic: 0.005612 on 1 and 424 DF,  p-value: 0.9403
## 
## 
## $transformed_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -142.23  -82.78  -41.33   29.30 1070.32 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             240.215     75.256   3.192  0.00152 **
## Avg.Volume_transformed   -9.402      6.172  -1.523  0.12841   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142 on 424 degrees of freedom
## Multiple R-squared:  0.005444,   Adjusted R-squared:  0.003098 
## F-statistic: 2.321 on 1 and 424 DF,  p-value: 0.1284
## 
## 
## $model_without_outliers
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -77.32  -36.00   35.77  454.30 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.205e+02  5.896e+00  20.431   <2e-16 ***
## Avg.Volume  -1.676e-06  1.101e-06  -1.522    0.129    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.7 on 419 degrees of freedom
## Multiple R-squared:  0.005501,   Adjusted R-squared:  0.003128 
## F-statistic: 2.318 on 1 and 419 DF,  p-value: 0.1287
results$influential_outliers # Name of the outliers
##     Ticker             Sector                      Industry        Region
## 98     GWW        Industrials       Industrial Distribution North America
## 151   EQIX        Real Estate              REIT - Specialty North America
## 268   COST Consumer Defensive               Discount Stores North America
## 361   TSLA  Consumer Cyclical            Auto Manufacturers North America
## 622    GHC Consumer Defensive Education & Training Services North America
##     Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98                  Large Cap         Medium Volatility          Growth
## 151                 Large Cap            Low Volatility          Growth
## 268                 Large Cap            Low Volatility          Growth
## 361                 Large Cap           High Volatility          Growth
## 622                   Mid Cap         Medium Volatility          Growth
##     P.E.Ratio Dividend.Yield....  Beta  Avg.Volume Recent.Close.Price   EPS
## 98   32.68276               0.68 1.151   233209.16            1205.34 36.88
## 151  88.50135               1.74 0.701   525569.72             981.48 11.09
## 268  58.83051               0.48 0.789  1990399.60             971.88 16.52
## 361  94.30601               0.00 2.295 96770662.55             345.16  3.66
## 622  18.18239               0.74 1.114    15381.27             931.12 51.21
##          Revenue  Net.Income Debt.to.Equity.Ratio     ROE P.B.Ratio
## 98   16931999744  1828999936               83.325 0.52611 16.759687
## 151   8401490944  1056177984              140.964 0.08267  6.969451
## 268 254453006336  7367000064               42.118 0.30267 18.231411
## 361  97150001152 12743000064               18.078 0.20389 15.828671
## 622   4711917056   227524992               32.787 0.05939  1.007689
##     Free.Cash.Flow Analyst.Ratings Insider.Transactions Avg.Volume_transformed
## 98      1546249984       strongBuy             Positive               11.18847
## 151     2946253568             buy             Negative               11.84765
## 268     4467500032             buy             Negative               12.90911
## 361      676625024             buy             Negative               15.87610
## 622      213084880            sell             Positive                8.91794
## With this variable a log transformation is also a valid transform so as a result we will test it
data <- stock_data
log_transformed_var <- paste0(predictor_var, "_Log") #take adv of R's env manager
data[[log_transformed_var]] <- log(stock_data[[predictor_var]])
model4 <- lm(as.formula(paste(dependent_var, "~", log_transformed_var)), data = data)
par(mfrow = c(2, 2))
plot(model4, main = "Model with Log Transform")

summary(model4)
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", log_transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -141.84  -82.41  -41.31   29.25 1070.25 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     228.587     67.316   3.396 0.000749 ***
## Avg.Volume_Log   -7.565      4.940  -1.531 0.126460    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142 on 424 degrees of freedom
## Multiple R-squared:  0.005499,   Adjusted R-squared:  0.003154 
## F-statistic: 2.345 on 1 and 424 DF,  p-value: 0.1265
p4 <- ggplot(data, aes_string(x = log_transformed_var, y = dependent_var)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "blue", se = TRUE) +
    theme_minimal() +
    labs(title = "Log Model",
         x = predictor_var,
         y = dependent_var)
print(p4)
## `geom_smooth()` using formula = 'y ~ x'

This variable will likely not be used in future analysis as none of the linear models (base, transformed, or outlier removed) demonstrate statistical significance. This means that this variable, while passing Boruta, does not show much promise. I will continue to use the base variable when I try different models with it, but as of now it looks like I will not be using it in the future. It is important to note here that log transform also worked along with the ideal transform so I tested both here.

Analysis for Net Income

helper.analyze_variable(stock_data, "Net.Income")

## 
## Summary Statistics for Net.Income :
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.302e+06 9.838e+07 3.065e+08 1.070e+09 9.530e+08 3.370e+10
data = stock_data 
dependent_var = "Recent.Close.Price"
predictor_var = "Net.Income"
results <- detect_nonlinearities_and_outliers(
  data = stock_data, 
  dependent_var = "Recent.Close.Price", 
  predictor_var = "Net.Income"
)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Output results
print("Optimal Lambda from powerTransform")
## [1] "Optimal Lambda from powerTransform"
results$lambda  # Optimal lambda
##         Y1 
## 0.02544285
results$model_summaries  # Summaries of the three models
## $base_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -437.13  -80.70  -37.90   27.15 1069.30 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.120e+02  7.276e+00  15.392  < 2e-16 ***
## Net.Income  1.315e-08  2.638e-09   4.985 9.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.4 on 424 degrees of freedom
## Multiple R-squared:  0.05536,    Adjusted R-squared:  0.05313 
## F-statistic: 24.85 on 1 and 424 DF,  p-value: 9.058e-07
## 
## 
## $transformed_model
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -213.17  -65.02  -22.67   29.32 1014.16 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -421.626     57.721  -7.305 1.39e-12 ***
## Net.Income_transformed   21.639      2.267   9.545  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129.2 on 424 degrees of freedom
## Multiple R-squared:  0.1769, Adjusted R-squared:  0.1749 
## F-statistic:  91.1 on 1 and 424 DF,  p-value: < 2.2e-16
## 
## 
## $model_without_outliers
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", predictor_var)), 
##     data = data_without_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -280.47  -73.10  -31.51   27.10  857.44 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.015e+02  6.565e+00  15.461  < 2e-16 ***
## Net.Income  2.135e-08  3.173e-09   6.726 5.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.1 on 420 degrees of freedom
## Multiple R-squared:  0.09724,    Adjusted R-squared:  0.0951 
## F-statistic: 45.24 on 1 and 420 DF,  p-value: 5.711e-11
results$influential_outliers # Name of the outliers
##     Ticker             Sector                Industry        Region
## 98     GWW        Industrials Industrial Distribution North America
## 174    XOM             Energy    Oil & Gas Integrated North America
## 268   COST Consumer Defensive         Discount Stores North America
## 440    WMT Consumer Defensive         Discount Stores North America
##     Market.Cap.Classification Volatility.Classification Growth.vs.Value
## 98                  Large Cap         Medium Volatility          Growth
## 174                 Large Cap         Medium Volatility           Value
## 268                 Large Cap            Low Volatility          Growth
## 440                 Large Cap            Low Volatility          Growth
##     P.E.Ratio Dividend.Yield....  Beta Avg.Volume Recent.Close.Price   EPS
## 98   32.68276               0.68 1.151   233209.2            1205.34 36.88
## 174  14.68991               3.36 0.880 16762449.8             117.96  8.03
## 268  58.83051               0.48 0.789  1990399.6             971.88 16.52
## 440  38.22314               0.99 0.516 17702963.3              92.50  2.42
##         Revenue Net.Income Debt.to.Equity.Ratio     ROE P.B.Ratio
## 98  1.69320e+10 1.8290e+09               83.325 0.52611 16.759687
## 174 3.43818e+11 3.3700e+10               15.394 0.14514  1.930226
## 268 2.54453e+11 7.3670e+09               42.118 0.30267 18.231411
## 440 6.65035e+11 1.5552e+10               69.566 0.18532  8.883981
##     Free.Cash.Flow Analyst.Ratings Insider.Transactions Net.Income_transformed
## 98      1546249984       strongBuy             Positive               28.31875
## 174    28811499520             buy             Negative               33.52234
## 268     4467500032             buy             Negative               30.75882
## 440     7701374976             buy             Negative               32.10348
## With this variable a log transformation is also a valid transform so as a result we will test it
data <- stock_data
log_transformed_var <- paste0(predictor_var, "_Log") #take adv of R's env manager
data[[log_transformed_var]] <- log(stock_data[[predictor_var]])
model4 <- lm(as.formula(paste(dependent_var, "~", log_transformed_var)), data = data)
par(mfrow = c(2, 2))
plot(model4, main = "Model with Log Transform")

summary(model4)
## 
## Call:
## lm(formula = as.formula(paste(dependent_var, "~", log_transformed_var)), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -210.61  -65.84  -23.33   28.96 1014.65 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -564.573     72.829  -7.752 6.77e-14 ***
## Net.Income_Log   35.414      3.721   9.518  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129.3 on 424 degrees of freedom
## Multiple R-squared:  0.176,  Adjusted R-squared:  0.1741 
## F-statistic: 90.59 on 1 and 424 DF,  p-value: < 2.2e-16
p4 <- ggplot(data, aes_string(x = log_transformed_var, y = dependent_var)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "blue", se = TRUE) +
    theme_minimal() +
    labs(title = "Log Model",
         x = predictor_var,
         y = dependent_var)
print(p4)
## `geom_smooth()` using formula = 'y ~ x'

stock_data$Net.Income_log = log(stock_data$Net.Income)

For NetIncome I will use a log transform as it gives the highest R-squared and it is a transformation that I am comfortable using. Additionally it does cause a loss of ecnomic understanding as it does not lead to negative values. In the sense of outliers I do not think it is worth trying to remove the four values as they do not lead to a statistically significant difference between R-squared of the base model and the outlier removed model. ## Descriptive Analysis of Factor Variables Please refer to section 1 for boxplots and ANOVA analysis of the chosen factor variables. There do not exist density plots for factor variables and additionally, linearity and outliers are not a concern with factor variables. ## Correlation Analysis

factor_vars <- c("Market.Cap.Classification", "Growth.vs.Value")
selected_vars <- c(factor_vars, top_5_vars_stock)

correlation_data_set <- stock_data[selected_vars]
correlation_data_set$Market.Cap.Classification <- as.factor(correlation_data_set$Market.Cap.Classification)
correlation_data_set$Growth.vs.Value <- as.factor(correlation_data_set$Growth.vs.Value)
correlation_test <- data.frame(lapply(correlation_data_set, function(x) {
  if (is.factor(x)) as.numeric(x) else x
}))
cor_matrix <- cor(correlation_test, use = "complete.obs")

strong_correlations <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
strong_pairs <- data.frame(
  Var1 = rownames(cor_matrix)[strong_correlations[, 1]],
  Var2 = colnames(cor_matrix)[strong_correlations[, 2]],
  Correlation = cor_matrix[strong_correlations]
)
print(strong_pairs)
## [1] Var1        Var2        Correlation
## <0 rows> (or 0-length row.names)
# Plot the correlation matrix this is hard to view due to the 21 variables but it still somewhat useful to see
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)

There appears to be no variables with high correlation between one another. This means that we can proceed with our linear model with all of our varaibles without having to worry about multicolinearity. I prefer using a correlation matrix over VIF because I am more comfortable with the math behind this one, and I have used it more often. I will still use VIF in step 3. # 3 ## Initial Model Building

# i wanna be able to use this variable when making my models
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
# idw deal with selecting only my chosen vars
model_data <- stock_data[, c("Recent.Close.Price", selected_vars)]
model_data$Net.Income_log <- log(model_data$Net.Income)
# making sure factor vars so we get one off encoding for free
model_data$Market.Cap.Classification <- as.factor(model_data$Market.Cap.Classification)
model_data$Growth.vs.Value <- as.factor(model_data$Growth.vs.Value)

initial_model <- lm(formula, data = model_data)
par(mfrow = c(2, 2))  # Reset plotting area
plot(initial_model, main = "Initial Model")

summary(initial_model)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -384.95  -32.71  -12.58   23.86  703.95 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         5.813e+01  9.277e+00   6.266 9.22e-10 ***
## Market.Cap.ClassificationMid Cap   -3.263e+01  9.585e+00  -3.404 0.000728 ***
## Market.Cap.ClassificationSmall Cap -4.458e+01  1.217e+01  -3.662 0.000282 ***
## Growth.vs.ValueValue               -9.309e+01  1.066e+01  -8.730  < 2e-16 ***
## EPS                                 1.752e+01  7.130e-01  24.569  < 2e-16 ***
## P.E.Ratio                           1.294e-01  5.176e-02   2.500 0.012801 *  
## P.B.Ratio                           1.854e+00  3.958e-01   4.684 3.81e-06 ***
## Avg.Volume                          2.703e-09  6.574e-07   0.004 0.996721    
## Net.Income                          7.002e-10  1.833e-09   0.382 0.702633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.44 on 417 degrees of freedom
## Multiple R-squared:  0.6862, Adjusted R-squared:  0.6802 
## F-statistic:   114 on 8 and 417 DF,  p-value: < 2.2e-16

From the initial model we can eliminate Avg.Volume and Net.Income as they have p-values of over 0.05. We have a pretty good R^2 value for stock prediction however. We will test it later with Net.Income_Log however. ### VIF testing

vif_values <- vif(initial_model)
print(vif_values)
##                               GVIF Df GVIF^(1/(2*Df))
## Market.Cap.Classification 1.386469  2        1.085119
## Growth.vs.Value           1.195510  1        1.093394
## EPS                       1.298475  1        1.139507
## P.E.Ratio                 1.095301  1        1.046566
## P.B.Ratio                 1.097711  1        1.047717
## Avg.Volume                1.286456  1        1.134220
## Net.Income                1.429539  1        1.195633

There is nothing that has high multi-colinearity which is the same as the results from the correlation matrix. ## Testing initial model with Net.Income_log and without Avg.Volume

quantitative_vars <- c("EPS", "P.E.Ratio", "P.B.Ratio", "Net.Income_log")
selected_vars <- c(factor_vars, quantitative_vars)
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
model_2 <- lm(formula, data = model_data)
par(mfrow = c(2, 2))  # Reset plotting area
plot(model_2, main = "Model 2")

summary(model_2)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -383.87  -33.14  -12.11   23.63  702.81 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          5.87635  100.23757   0.059   0.9533    
## Market.Cap.ClassificationMid Cap   -29.36052   12.37351  -2.373   0.0181 *  
## Market.Cap.ClassificationSmall Cap -37.05171   20.28498  -1.827   0.0685 .  
## Growth.vs.ValueValue               -94.27068   11.05572  -8.527 2.78e-16 ***
## EPS                                 17.45541    0.71415  24.442  < 2e-16 ***
## P.E.Ratio                            0.14283    0.05757   2.481   0.0135 *  
## P.B.Ratio                            1.84279    0.39636   4.649 4.47e-06 ***
## Net.Income_log                       2.58590    4.83308   0.535   0.5929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.34 on 418 degrees of freedom
## Multiple R-squared:  0.6863, Adjusted R-squared:  0.6811 
## F-statistic: 130.7 on 7 and 418 DF,  p-value: < 2.2e-16

After testing this model we can see that even Net.Income_log is still not statistically significant and the inclusion of it does not change the R^2 value by a statistically significant value meaning I will be exclusing Net.Income_log and Net.Income from future models. ## Testing Model 3 with only 3 quantiative variables

model_data$Avg.Volume <- NULL
model_data$Net.Income <- NULL
model_data$Net.Income_log <- NULL
quantitative_vars <- c("EPS", "P.E.Ratio", "P.B.Ratio")
selected_vars <- c(factor_vars, quantitative_vars)
formula <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
model_3 <- lm(formula, data = model_data)
par(mfrow = c(2, 2))  # Reset plotting area
plot(model_3, main = "Model 3")

summary(model_3)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -385.55  -31.80  -12.47   23.44  703.15 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         59.30879    8.61666   6.883 2.15e-11 ***
## Market.Cap.ClassificationMid Cap   -33.90583    8.98870  -3.772 0.000185 ***
## Market.Cap.ClassificationSmall Cap -45.96666   11.55963  -3.976 8.24e-05 ***
## Growth.vs.ValueValue               -92.49990   10.53972  -8.776  < 2e-16 ***
## EPS                                 17.55041    0.69114  25.394  < 2e-16 ***
## P.E.Ratio                            0.12871    0.05112   2.518 0.012178 *  
## P.B.Ratio                            1.86534    0.39378   4.737 2.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.27 on 419 degrees of freedom
## Multiple R-squared:  0.6861, Adjusted R-squared:  0.6816 
## F-statistic: 152.6 on 6 and 419 DF,  p-value: < 2.2e-16

So far every single one of these variables are statistically significant so as a result we will proceed with this model for further analysis. Moreover we will be testing these variables with the Ramsey RESET test for model misspecefication and have omitted the two other variables from future consideration and testing. ## Testing Model Misspecification

resettest(model_3, power = 2, type = "regressor")
## 
##  RESET test
## 
## data:  model_3
## RESET = 16.289, df1 = 3, df2 = 416, p-value = 4.998e-10
resettest(model_3, power = 3, type = "regressor")
## 
##  RESET test
## 
## data:  model_3
## RESET = 8.3554, df1 = 3, df2 = 416, p-value = 2.092e-05

It appears that we need to do a non-linear transformation for at least one of the quantiative variables so we will be trying that for the next slide. ### testing each of the selected variables to see if a transformation is needed

for (var in quantitative_vars) {
  # Define formulas
  formula_test_square <- as.formula(paste("Recent.Close.Price ~", 
                                          paste(selected_vars, collapse = " + "), 
                                          "+ I(", var, "^2)"))
  formula_test_both <- as.formula(paste("Recent.Close.Price ~", 
                                        paste(selected_vars, collapse = " + "), 
                                        "+ I(", var, "^2) + I(", var, "^3)"))
  formula_test_cube <- as.formula(paste("Recent.Close.Price ~", 
                                        paste(selected_vars, collapse = " + "), 
                                        "+ I(", var, "^3)"))
  formula_test_no_linear <- as.formula(paste("Recent.Close.Price ~", 
                                             paste(selected_vars, collapse = " + "), 
                                             "+ I(", var, "^2) + I(", var, "^3) -", var))
  
  # Fit models
  model_test_square <- lm(formula_test_square, data = model_data)
  model_test_both <- lm(formula_test_both, data = model_data)
  model_test_cube <- lm(formula_test_cube, data = model_data)
  model_test_no_linear <- lm(formula_test_no_linear, data = model_data)
  
  # Print BIC values
  print(paste("Testing non-linearity for variable:", var))
  print(BIC(model_3, model_test_square, model_test_cube, model_test_both, model_test_no_linear))
  
  # Extract R-squared and Adjusted R-squared values
  r_squared_values <- c(
    summary(model_test_square)$r.squared,
    summary(model_test_cube)$r.squared,
    summary(model_test_both)$r.squared,
    summary(model_test_no_linear)$r.squared
  )
  
  adj_r_squared_values <- c(
    summary(model_test_square)$adj.r.squared,
    summary(model_test_cube)$adj.r.squared,
    summary(model_test_both)$adj.r.squared,
    summary(model_test_no_linear)$adj.r.squared
  )
  
  # Print R-squared and Adjusted R-squared values
  print(paste("R-squared values for models with", var, ":"))
  print(r_squared_values)
  
  print(paste("Adjusted R-squared values for models with", var, ":"))
  print(adj_r_squared_values)
}
## [1] "Testing non-linearity for variable: EPS"
##                      df      BIC
## model_3               8 4986.635
## model_test_square     9 4979.291
## model_test_cube       9 4986.125
## model_test_both      10 4966.657
## model_test_no_linear  9 5080.614
## [1] "R-squared values for models with EPS :"
## [1] 0.6958361 0.6909166 0.7088909 0.6141627
## [1] "Adjusted R-squared values for models with EPS :"
## [1] 0.6907424 0.6857406 0.7033061 0.6077013
## [1] "Testing non-linearity for variable: P.E.Ratio"
##                      df      BIC
## model_3               8 4986.635
## model_test_square     9 4981.593
## model_test_cube       9 4986.910
## model_test_both      10 4963.063
## model_test_no_linear  9 4997.502
## [1] "R-squared values for models with P.E.Ratio :"
## [1] 0.6941874 0.6903470 0.7113365 0.6825509
## [1] "Adjusted R-squared values for models with P.E.Ratio :"
## [1] 0.6890662 0.6851614 0.7057986 0.6772347
## [1] "Testing non-linearity for variable: P.B.Ratio"
##                      df      BIC
## model_3               8 4986.635
## model_test_square     9 4971.871
## model_test_cube       9 4980.229
## model_test_both      10 4959.054
## model_test_no_linear  9 5003.738
## [1] "R-squared values for models with P.B.Ratio :"
## [1] 0.7010876 0.6951656 0.7140402 0.6778703
## [1] "Adjusted R-squared values for models with P.B.Ratio :"
## [1] 0.6960819 0.6900607 0.7085542 0.6724757

It appears through testing of trying square and cube powers of the quantitative variables that for all of our variables we would prefer a model that includes a linear, quadratic, and cubic, representation of each variable. We have BIC and adjusted R^2 values of our individual model, but now we will combine the different models to figure out which one is the best representation. So far the best model in terms of BIC is model_test_both with P.B.Ratio being transformed only with an BIC of 4959.054 All further models will be tested against this it also has adjusted R^2 of 0.7085542. We are using BIC instead of AIC right now because we want to focus on selecting the best possible model for our current data, because we are not planning on doing any forecasting. We will use AIC at the end however to confirm our selections.
### Testing possible Permutations to find best model through AIC and adjusted R^2

# Function to fit models and compare adjusted R-squared values
compare_models <- function(data, dependent_var, pairs) {
  results <- list()
  for (pair in pairs) {
    var1 <- pair[1]
    var2 <- pair[2]
    
    # Define formulas
    formula_test <- as.formula(paste(dependent_var, "~", 
                                     paste(selected_vars, collapse = " + "), 
                                     "+ I(", var1, "^2) + I(", var1, "^3) + I(", var2, "^2) + I(", var2, "^3)"))
    
    # fit our model
    model_test <- lm(formula_test, data = data)
    
    
    adj_r_squared <- summary(model_test)$adj.r.squared
    bic_result  <- BIC(model_test)
    
    # store results
    results[[paste(var1, var2, sep = "_")]] <- list(adj_r_squared = adj_r_squared, BIC = bic_result)
  }
  
  # now doing for all three
  var1 <- quantitative_vars[1]
  var2 <- quantitative_vars[2]
  var3 <- quantitative_vars[3]
  formula_test <- as.formula(paste("Recent.Close.Price ~", 
                                 paste(selected_vars, collapse = " + "), 
                                 "+ I(", var1, "^2) + I(", var1, "^3) + I(", var2, "^2) + I(", var2, "^3) + I(", var3, "^2) + I(", var3,
                                 "^3)"))
 
  model_test <- lm(formula_test, data = data)
  adj_r_squared <- summary(model_test)$adj.r.squared
  bic_result <- BIC(model_test)
    
  # store results
  results[[paste(var1, var2, var3, sep = "_")]] <- list(adj_r_squared = adj_r_squared, BIC = bic_result)
    
  return(results)
}

dependent_var <- "Recent.Close.Price"
# shoutout R for having a free permutation generator
pairs <- combn(quantitative_vars, 2, simplify = FALSE)
# run the function
results <- compare_models(model_data, dependent_var, pairs)

# Print results
print("Adjusted R-squared  and BIC values for all pairs:")
## [1] "Adjusted R-squared  and BIC values for all pairs:"
print(results)
## $EPS_P.E.Ratio
## $EPS_P.E.Ratio$adj_r_squared
## [1] 0.7427126
## 
## $EPS_P.E.Ratio$BIC
## [1] 4916.01
## 
## 
## $EPS_P.B.Ratio
## $EPS_P.B.Ratio$adj_r_squared
## [1] 0.7276459
## 
## $EPS_P.B.Ratio$BIC
## [1] 4940.253
## 
## 
## $P.E.Ratio_P.B.Ratio
## $P.E.Ratio_P.B.Ratio$adj_r_squared
## [1] 0.7276538
## 
## $P.E.Ratio_P.B.Ratio$BIC
## [1] 4940.241
## 
## 
## $EPS_P.E.Ratio_P.B.Ratio
## $EPS_P.E.Ratio_P.B.Ratio$adj_r_squared
## [1] 0.7597781
## 
## $EPS_P.E.Ratio_P.B.Ratio$BIC
## [1] 4896.824

After Conducting a test of all possible permutations it looks like a model with all three of them having squared and cubed terms is the model that has the highest adjusted R^2 value with the lowest BIC model so we will proceed with this for now. The one worry I have is that I am artifically increasing R^2 by adding more variables, but since I am adjusted R^2 and BIC, I think it is fine. ### Summary of Model after testing for model misspecification

poly_terms <- c()
for (var in quantitative_vars) {
  poly_terms <- c(poly_terms, paste0("I(", var, "^2)"), paste0("I(", var, "^3)"))
}
selected_vars <- c(quantitative_vars, factor_vars, poly_terms)

formula_test <- as.formula(paste("Recent.Close.Price ~", paste(selected_vars, collapse = " + ")))
      

model_4 <- lm(formula_test, data = model_data)
par(mfrow = c(2, 2))  # Reset plotting area
plot(model_4, main = "Model 4")
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

print(BIC(model_4))
## [1] 4896.824
summary(model_4)
## 
## Call:
## lm(formula = formula_test, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -265.26  -27.65   -3.01   18.83  600.15 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -7.902e+01  1.401e+01  -5.639 3.18e-08 ***
## EPS                                 3.739e+01  2.686e+00  13.920  < 2e-16 ***
## P.E.Ratio                           1.738e+00  2.115e-01   8.218 2.70e-15 ***
## P.B.Ratio                           8.289e+00  1.325e+00   6.256 9.86e-10 ***
## Market.Cap.ClassificationMid Cap   -1.703e+01  8.067e+00  -2.111 0.035376 *  
## Market.Cap.ClassificationSmall Cap  3.121e+00  1.112e+01   0.281 0.779173    
## Growth.vs.ValueValue               -6.494e+01  1.009e+01  -6.433 3.47e-10 ***
## I(EPS^2)                           -1.170e+00  1.698e-01  -6.893 2.05e-11 ***
## I(EPS^3)                            1.598e-02  2.677e-03   5.969 5.13e-09 ***
## I(P.E.Ratio^2)                     -4.946e-03  7.458e-04  -6.631 1.04e-10 ***
## I(P.E.Ratio^3)                      3.388e-06  5.788e-07   5.853 9.84e-09 ***
## I(P.B.Ratio^2)                     -1.563e-01  3.375e-02  -4.632 4.86e-06 ***
## I(P.B.Ratio^3)                      7.375e-04  1.917e-04   3.847 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.72 on 413 degrees of freedom
## Multiple R-squared:  0.7666, Adjusted R-squared:  0.7598 
## F-statistic:   113 on 12 and 413 DF,  p-value: < 2.2e-16

As of now this appears to be our best model although it does reduce the importance of the market_cap classification it is still the highest R^2 model we have. We will proceed with it as our fourth possible model, and perform tests to see if we need interaction terms with it. It has an adjusted r^2 of 0.7597781 with a BIC of 4896.824. ## Testing for Need of interaction terms

# Function to test if interaction terms are needed
test_interaction_terms <- function(data, dependent_var, selected_vars) {
  results <- data.frame(
    model = character(),
    var1 = character(),
    var2 = character(),
    adj_r_squared = numeric(),
    AIC = numeric(),
    BIC = numeric(),
    stringsAsFactors = FALSE
  )
  
  # Define base model formula
  base_formula <- as.formula(paste(dependent_var, "~", paste(selected_vars, collapse = " + ")))
  
  # Fit base model
  base_model <- lm(base_formula, data = data)
  
  # Extract adjusted R-squared, AIC, and BIC for base model
  base_adj_r_squared <- summary(base_model)$adj.r.squared
  base_aic <- AIC(base_model)
  base_bic <- BIC(base_model)
  
  # Store base model results
  results <- rbind(results, data.frame(
    model = "base_model",
    var1 = NA,
    var2 = NA,
    adj_r_squared = base_adj_r_squared,
    AIC = base_aic,
    BIC = base_bic,
    stringsAsFactors = FALSE
  ))
  
  # Generate all possible pairs of selected variables for interaction terms
  pairs <- combn(selected_vars, 2, simplify = FALSE)
  
  for (pair in pairs) {
    # i was getting duplicate this elimates it (still computationally longer but too lazy to figure out optimizations)
    pair <- sort(pair)
    var1 <- pair[1]
    var2 <- pair[2]
    
    # Define formula with interaction term
    interaction_formula <- as.formula(paste(dependent_var, "~", 
                                            paste(selected_vars, collapse = " + "), 
                                            "+", paste(var1, "*", var2)))
    
    # Fit model with interaction term
    interaction_model <- lm(interaction_formula, data = data)
    
    # Extract adjusted R-squared, AIC, and BIC for interaction model
    interaction_adj_r_squared <- summary(interaction_model)$adj.r.squared
    interaction_aic <- AIC(interaction_model)
    interaction_bic <- BIC(interaction_model)
    
    # Store interaction model results
    results <- rbind(results, data.frame(
      model = paste(var1, var2, sep = "_"),
      var1 = var1,
      var2 = var2,
      adj_r_squared = interaction_adj_r_squared,
      AIC = interaction_aic,
      BIC = interaction_bic,
      stringsAsFactors = FALSE
    ))
  }
  
  return(results)
}

# Example usage
dependent_var <- "Recent.Close.Price"

# Test for interaction terms
interaction_results <- test_interaction_terms(model_data, dependent_var, selected_vars)
interaction_results <- sort_by(interaction_results, interaction_results$BIC)

# we need to remove all instance of eps and p.e.ratio being interaction variables because the product of those is just the share price making the entire model useless. So this is why i am removing them from potential interaction terms. 
interaction_results <- helper.remove_specific_rows(interaction_results, "EPS", "P.E.Ratio")
# Print results
print("Model comparison results with interaction terms:")
## [1] "Model comparison results with interaction terms:"
print(interaction_results)
##                                        model                      var1
## 5                        EPS_Growth.vs.Value                       EPS
## 23                        I(EPS^2)_P.B.Ratio                  I(EPS^2)
## 24                        I(EPS^3)_P.B.Ratio                  I(EPS^3)
## 3                              EPS_P.B.Ratio                       EPS
## 36                  Growth.vs.Value_I(EPS^2)           Growth.vs.Value
## 49                   I(EPS^3)_I(P.B.Ratio^2)                  I(EPS^3)
## 31        I(EPS^3)_Market.Cap.Classification                  I(EPS^3)
## 37                  Growth.vs.Value_I(EPS^3)           Growth.vs.Value
## 30        I(EPS^2)_Market.Cap.Classification                  I(EPS^2)
## 18                  I(P.E.Ratio^3)_P.E.Ratio            I(P.E.Ratio^3)
## 4              EPS_Market.Cap.Classification                       EPS
## 51             I(P.E.Ratio^2)_I(P.E.Ratio^3)            I(P.E.Ratio^2)
## 45                   I(EPS^2)_I(P.B.Ratio^2)                  I(EPS^2)
## 21       Market.Cap.Classification_P.B.Ratio Market.Cap.Classification
## 13       Market.Cap.Classification_P.E.Ratio Market.Cap.Classification
## 29 Growth.vs.Value_Market.Cap.Classification           Growth.vs.Value
## 10                        EPS_I(P.B.Ratio^2)                       EPS
## 50                   I(EPS^3)_I(P.B.Ratio^3)                  I(EPS^3)
## 32  I(P.E.Ratio^2)_Market.Cap.Classification            I(P.E.Ratio^2)
## 22                 Growth.vs.Value_P.B.Ratio           Growth.vs.Value
## 34  I(P.B.Ratio^2)_Market.Cap.Classification            I(P.B.Ratio^2)
## 1                                 base_model                      <NA>
## 6                               EPS_I(EPS^2)                       EPS
## 17                  I(P.E.Ratio^2)_P.E.Ratio            I(P.E.Ratio^2)
## 27                  I(P.B.Ratio^2)_P.B.Ratio            I(P.B.Ratio^2)
## 33  I(P.E.Ratio^3)_Market.Cap.Classification            I(P.E.Ratio^3)
## 55             I(P.B.Ratio^3)_I(P.E.Ratio^3)            I(P.B.Ratio^3)
## 54             I(P.B.Ratio^2)_I(P.E.Ratio^3)            I(P.B.Ratio^2)
## 26                  I(P.E.Ratio^3)_P.B.Ratio            I(P.E.Ratio^3)
## 52             I(P.B.Ratio^2)_I(P.E.Ratio^2)            I(P.B.Ratio^2)
## 53             I(P.B.Ratio^3)_I(P.E.Ratio^2)            I(P.B.Ratio^3)
## 46                   I(EPS^2)_I(P.B.Ratio^3)                  I(EPS^2)
## 25                  I(P.E.Ratio^2)_P.B.Ratio            I(P.E.Ratio^2)
## 35  I(P.B.Ratio^3)_Market.Cap.Classification            I(P.B.Ratio^3)
## 40            Growth.vs.Value_I(P.B.Ratio^2)           Growth.vs.Value
## 39            Growth.vs.Value_I(P.E.Ratio^3)           Growth.vs.Value
## 19                  I(P.B.Ratio^2)_P.E.Ratio            I(P.B.Ratio^2)
## 38            Growth.vs.Value_I(P.E.Ratio^2)           Growth.vs.Value
## 11                        EPS_I(P.B.Ratio^3)                       EPS
## 42                         I(EPS^2)_I(EPS^3)                  I(EPS^2)
## 20                  I(P.B.Ratio^3)_P.E.Ratio            I(P.B.Ratio^3)
## 56             I(P.B.Ratio^2)_I(P.B.Ratio^3)            I(P.B.Ratio^2)
## 7                               EPS_I(EPS^3)                       EPS
## 28                  I(P.B.Ratio^3)_P.B.Ratio            I(P.B.Ratio^3)
## 14                 Growth.vs.Value_P.E.Ratio           Growth.vs.Value
## 41            Growth.vs.Value_I(P.B.Ratio^3)           Growth.vs.Value
## 12                       P.B.Ratio_P.E.Ratio                 P.B.Ratio
##                         var2 adj_r_squared      AIC      BIC
## 5            Growth.vs.Value     0.8164550 4726.391 4787.208
## 23                 P.B.Ratio     0.8105643 4739.848 4800.665
## 24                 P.B.Ratio     0.8082636 4744.991 4805.808
## 3                  P.B.Ratio     0.7969476 4769.419 4830.235
## 36                  I(EPS^2)     0.7948698 4773.756 4834.573
## 49            I(P.B.Ratio^2)     0.7888001 4786.178 4846.995
## 31 Market.Cap.Classification     0.7830120 4798.661 4863.532
## 37                  I(EPS^3)     0.7795155 4804.506 4865.322
## 30 Market.Cap.Classification     0.7819840 4800.674 4865.545
## 18                 P.E.Ratio     0.7747108 4813.689 4874.506
## 4  Market.Cap.Classification     0.7765463 4811.169 4876.040
## 51            I(P.E.Ratio^3)     0.7725229 4817.806 4878.623
## 45            I(P.B.Ratio^2)     0.7716715 4819.398 4880.214
## 21                 P.B.Ratio     0.7701138 4823.259 4888.130
## 13                 P.E.Ratio     0.7700402 4823.395 4888.266
## 29 Market.Cap.Classification     0.7677172 4827.677 4892.548
## 10            I(P.B.Ratio^2)     0.7642560 4833.013 4893.830
## 50            I(P.B.Ratio^3)     0.7638947 4833.666 4894.482
## 32 Market.Cap.Classification     0.7661946 4830.460 4895.331
## 22                 P.B.Ratio     0.7634125 4834.535 4895.351
## 34 Market.Cap.Classification     0.7660776 4830.673 4895.544
## 1                       <NA>     0.7597781 4840.062 4896.824
## 6                   I(EPS^2)     0.7597781 4840.062 4896.824
## 17                 P.E.Ratio     0.7597781 4840.062 4896.824
## 27                 P.B.Ratio     0.7597781 4840.062 4896.824
## 33 Market.Cap.Classification     0.7653682 4831.963 4896.835
## 55            I(P.E.Ratio^3)     0.7619569 4837.148 4897.964
## 54            I(P.E.Ratio^3)     0.7619561 4837.149 4897.966
## 26                 P.B.Ratio     0.7616645 4837.671 4898.487
## 52            I(P.E.Ratio^2)     0.7614559 4838.043 4898.860
## 53            I(P.E.Ratio^2)     0.7613985 4838.146 4898.962
## 46            I(P.B.Ratio^3)     0.7609455 4838.954 4899.770
## 25                 P.B.Ratio     0.7608145 4839.187 4900.004
## 35 Market.Cap.Classification     0.7634789 4835.380 4900.251
## 40            I(P.B.Ratio^2)     0.7605664 4839.629 4900.445
## 39            I(P.E.Ratio^3)     0.7602069 4840.268 4901.085
## 19                 P.E.Ratio     0.7600302 4840.582 4901.398
## 38            I(P.E.Ratio^2)     0.7600216 4840.597 4901.414
## 11            I(P.B.Ratio^3)     0.7599919 4840.650 4901.466
## 42                  I(EPS^3)     0.7598407 4840.918 4901.735
## 20                 P.E.Ratio     0.7598363 4840.926 4901.742
## 56            I(P.B.Ratio^3)     0.7598001 4840.990 4901.807
## 7                   I(EPS^3)     0.7597492 4841.080 4901.897
## 28                 P.B.Ratio     0.7597248 4841.124 4901.940
## 14                 P.E.Ratio     0.7596689 4841.223 4902.039
## 41            I(P.B.Ratio^3)     0.7596402 4841.274 4902.090
## 12                 P.E.Ratio     0.7594031 4841.693 4902.510

I think that I could potentially extrapolate this formula to get to 3 level interaction variables and test multiple 2 way interactions but I feel this would be an infinitely regressive approach and am just going to proceed with the top model as my best model for now (model5). I am worried about overfitting when trying mutiple 2 way interactions and increased complexity. 3 way interaction variables are just things that I have never seen in all my years of doing stats so I will not proceed with that approach.

Finalizing Model 5

interaction_formula <- as.formula(paste(dependent_var, "~", 
                                            paste(selected_vars, collapse = " + "), 
                                            "+", paste("EPS", "*", "Growth.vs.Value")))
model_5 <- lm(interaction_formula, data = model_data)
par(mfrow = c(2, 2))  # Reset plotting area
plot(model_5, main = "Model 5")
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

print(BIC(model_5))
## [1] 4787.208
summary(model_5)
## 
## Call:
## lm(formula = interaction_formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174.16  -20.52    1.16   17.95  579.78 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -8.159e+01  1.225e+01  -6.660 8.81e-11 ***
## EPS                                 3.361e+01  2.372e+00  14.173  < 2e-16 ***
## P.E.Ratio                           1.901e+00  1.854e-01  10.253  < 2e-16 ***
## P.B.Ratio                           7.069e+00  1.163e+00   6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap   -1.294e+01  7.060e+00  -1.833 0.067482 .  
## Market.Cap.ClassificationSmall Cap -6.208e+00  9.757e+00  -0.636 0.524986    
## Growth.vs.ValueValue                2.598e+01  1.192e+01   2.179 0.029927 *  
## I(EPS^2)                           -5.683e-01  1.576e-01  -3.605 0.000351 ***
## I(EPS^3)                            6.707e-03  2.479e-03   2.706 0.007097 ** 
## I(P.E.Ratio^2)                     -5.419e-03  6.533e-04  -8.295 1.56e-15 ***
## I(P.E.Ratio^3)                      3.705e-06  5.067e-07   7.313 1.38e-12 ***
## I(P.B.Ratio^2)                     -1.355e-01  2.956e-02  -4.585 6.02e-06 ***
## I(P.B.Ratio^3)                      6.500e-04  1.677e-04   3.875 0.000124 ***
## EPS:Growth.vs.ValueValue           -1.265e+01  1.116e+00 -11.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8165 
## F-statistic: 146.4 on 13 and 412 DF,  p-value: < 2.2e-16

As of now this appears to be our best model although it does reduce the importance of the market_cap classification it is still the highest R^2 model we have. We will proceed with it as our fifth possible model, and perform all comparison tests now with our 5 potential models. (initial_model, model_2, model_3, model_4, and now model_5). ## Diagnostic Plots of our 5 models

helper.diagnostic_plots(initial_model)
summary(initial_model)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -384.95  -32.71  -12.58   23.86  703.95 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         5.813e+01  9.277e+00   6.266 9.22e-10 ***
## Market.Cap.ClassificationMid Cap   -3.263e+01  9.585e+00  -3.404 0.000728 ***
## Market.Cap.ClassificationSmall Cap -4.458e+01  1.217e+01  -3.662 0.000282 ***
## Growth.vs.ValueValue               -9.309e+01  1.066e+01  -8.730  < 2e-16 ***
## EPS                                 1.752e+01  7.130e-01  24.569  < 2e-16 ***
## P.E.Ratio                           1.294e-01  5.176e-02   2.500 0.012801 *  
## P.B.Ratio                           1.854e+00  3.958e-01   4.684 3.81e-06 ***
## Avg.Volume                          2.703e-09  6.574e-07   0.004 0.996721    
## Net.Income                          7.002e-10  1.833e-09   0.382 0.702633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.44 on 417 degrees of freedom
## Multiple R-squared:  0.6862, Adjusted R-squared:  0.6802 
## F-statistic:   114 on 8 and 417 DF,  p-value: < 2.2e-16

helper.diagnostic_plots(model_2)
summary(model_2)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -383.87  -33.14  -12.11   23.63  702.81 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          5.87635  100.23757   0.059   0.9533    
## Market.Cap.ClassificationMid Cap   -29.36052   12.37351  -2.373   0.0181 *  
## Market.Cap.ClassificationSmall Cap -37.05171   20.28498  -1.827   0.0685 .  
## Growth.vs.ValueValue               -94.27068   11.05572  -8.527 2.78e-16 ***
## EPS                                 17.45541    0.71415  24.442  < 2e-16 ***
## P.E.Ratio                            0.14283    0.05757   2.481   0.0135 *  
## P.B.Ratio                            1.84279    0.39636   4.649 4.47e-06 ***
## Net.Income_log                       2.58590    4.83308   0.535   0.5929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.34 on 418 degrees of freedom
## Multiple R-squared:  0.6863, Adjusted R-squared:  0.6811 
## F-statistic: 130.7 on 7 and 418 DF,  p-value: < 2.2e-16

helper.diagnostic_plots(model_3)
summary(model_3)
## 
## Call:
## lm(formula = formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -385.55  -31.80  -12.47   23.44  703.15 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         59.30879    8.61666   6.883 2.15e-11 ***
## Market.Cap.ClassificationMid Cap   -33.90583    8.98870  -3.772 0.000185 ***
## Market.Cap.ClassificationSmall Cap -45.96666   11.55963  -3.976 8.24e-05 ***
## Growth.vs.ValueValue               -92.49990   10.53972  -8.776  < 2e-16 ***
## EPS                                 17.55041    0.69114  25.394  < 2e-16 ***
## P.E.Ratio                            0.12871    0.05112   2.518 0.012178 *  
## P.B.Ratio                            1.86534    0.39378   4.737 2.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.27 on 419 degrees of freedom
## Multiple R-squared:  0.6861, Adjusted R-squared:  0.6816 
## F-statistic: 152.6 on 6 and 419 DF,  p-value: < 2.2e-16

helper.diagnostic_plots(model_4)
summary(model_4)
## 
## Call:
## lm(formula = formula_test, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -265.26  -27.65   -3.01   18.83  600.15 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -7.902e+01  1.401e+01  -5.639 3.18e-08 ***
## EPS                                 3.739e+01  2.686e+00  13.920  < 2e-16 ***
## P.E.Ratio                           1.738e+00  2.115e-01   8.218 2.70e-15 ***
## P.B.Ratio                           8.289e+00  1.325e+00   6.256 9.86e-10 ***
## Market.Cap.ClassificationMid Cap   -1.703e+01  8.067e+00  -2.111 0.035376 *  
## Market.Cap.ClassificationSmall Cap  3.121e+00  1.112e+01   0.281 0.779173    
## Growth.vs.ValueValue               -6.494e+01  1.009e+01  -6.433 3.47e-10 ***
## I(EPS^2)                           -1.170e+00  1.698e-01  -6.893 2.05e-11 ***
## I(EPS^3)                            1.598e-02  2.677e-03   5.969 5.13e-09 ***
## I(P.E.Ratio^2)                     -4.946e-03  7.458e-04  -6.631 1.04e-10 ***
## I(P.E.Ratio^3)                      3.388e-06  5.788e-07   5.853 9.84e-09 ***
## I(P.B.Ratio^2)                     -1.563e-01  3.375e-02  -4.632 4.86e-06 ***
## I(P.B.Ratio^3)                      7.375e-04  1.917e-04   3.847 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.72 on 413 degrees of freedom
## Multiple R-squared:  0.7666, Adjusted R-squared:  0.7598 
## F-statistic:   113 on 12 and 413 DF,  p-value: < 2.2e-16

helper.diagnostic_plots(model_5)
summary(model_5)
## 
## Call:
## lm(formula = interaction_formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174.16  -20.52    1.16   17.95  579.78 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -8.159e+01  1.225e+01  -6.660 8.81e-11 ***
## EPS                                 3.361e+01  2.372e+00  14.173  < 2e-16 ***
## P.E.Ratio                           1.901e+00  1.854e-01  10.253  < 2e-16 ***
## P.B.Ratio                           7.069e+00  1.163e+00   6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap   -1.294e+01  7.060e+00  -1.833 0.067482 .  
## Market.Cap.ClassificationSmall Cap -6.208e+00  9.757e+00  -0.636 0.524986    
## Growth.vs.ValueValue                2.598e+01  1.192e+01   2.179 0.029927 *  
## I(EPS^2)                           -5.683e-01  1.576e-01  -3.605 0.000351 ***
## I(EPS^3)                            6.707e-03  2.479e-03   2.706 0.007097 ** 
## I(P.E.Ratio^2)                     -5.419e-03  6.533e-04  -8.295 1.56e-15 ***
## I(P.E.Ratio^3)                      3.705e-06  5.067e-07   7.313 1.38e-12 ***
## I(P.B.Ratio^2)                     -1.355e-01  2.956e-02  -4.585 6.02e-06 ***
## I(P.B.Ratio^3)                      6.500e-04  1.677e-04   3.875 0.000124 ***
## EPS:Growth.vs.ValueValue           -1.265e+01  1.116e+00 -11.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8165 
## F-statistic: 146.4 on 13 and 412 DF,  p-value: < 2.2e-16

Model Selection through AIC and BIC and Cross Validation

# Compare AIC and BIC across models
compare_models <- function(models) {
  results <- data.frame(
    model = names(models),
    AIC = sapply(models, AIC),
    BIC = sapply(models, BIC)
  )
  return(results[order(results$AIC), ])
}

# Example usage
models <- list(model_1 = initial_model, model_2 = model_2, model_3 = model_3, model_4 = model_4, model_5 = model_5)
compare_models(models)
##           model      AIC      BIC
## model_5 model_5 4726.391 4787.208
## model_4 model_4 4840.062 4896.824
## model_3 model_3 4954.199 4986.635
## model_2 model_2 4955.908 4992.397
## model_1 model_1 4958.026 4998.571

Cross Validation

cross_validate_model <- function(model, data, model_name, k = 10) {
  control <- trainControl(method = "cv", number = k)
  cv_model <- train(formula(model), data = data, method = "lm", trControl = control)
  results <- cv_model$results
  results$model <- model_name
  return(results)
}

# I wanted to return as a dataframe so i did like this instead of printing out each one
cv_results_list <- list()

cv_results_list[[1]] <- cross_validate_model(model_5, model_data, "Model 5")
cv_results_list[[2]] <- cross_validate_model(model_4, model_data, "Model 4")
cv_results_list[[3]] <- cross_validate_model(model_3, model_data, "Model 3")
cv_results_list[[4]] <- cross_validate_model(model_2, stock_data, "Model 2")
cv_results_list[[5]] <- cross_validate_model(initial_model, stock_data, "Initial Model")

cv_results_df <- do.call(rbind, cv_results_list)
print(cv_results_df)
##   intercept      RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      TRUE  71.20719 0.7751845 38.97920 48.52094  0.1937944 14.705386
## 2      TRUE 106.23586 0.6957015 48.33682 86.00764  0.1748311 16.597494
## 3      TRUE  79.50272 0.6970157 47.85845 22.89017  0.1436857  5.013735
## 4      TRUE  78.67322 0.7093004 48.23251 29.31212  0.1775683 11.923131
## 5      TRUE  78.81967 0.6738216 48.38239 28.70402  0.1688269 10.694216
##           model
## 1       Model 5
## 2       Model 4
## 3       Model 3
## 4       Model 2
## 5 Initial Model

I will be choosing model_5 as our model and will proceed with it for all future tests and bootstrapping. This was known before, but I wanted to confirm that it was the best model. This is because it has the lowest AIC and BIC scores while at the same time having the lowest MAE score. Although it has a high RMSE, I am willing to overlook this because of its high performance in R^2, AIC, and BIC. This does raise questions about overfitting however as the cross-validation test is non-deterministic. ## Testing/Bootstrapping of Model 5 ### Multicollinearity Testing

vif_values <- vif(model_5, type = "predictor")
## GVIFs computed for predictors
print(vif_values)
##                               GVIF Df GVIF^(1/(2*Df))
## EPS                       2.349184  5        1.089160
## P.E.Ratio                 1.635818  3        1.085482
## P.B.Ratio                 1.351535  3        1.051489
## Market.Cap.Classification 1.527383  2        1.111698
## Growth.vs.Value           2.349184  5        1.089160
##                                                Interacts With
## EPS                       I(EPS^2), I(EPS^3), Growth.vs.Value
## P.E.Ratio                      I(P.E.Ratio^2), I(P.E.Ratio^3)
## P.B.Ratio                      I(P.B.Ratio^2), I(P.B.Ratio^3)
## Market.Cap.Classification                                --  
## Growth.vs.Value                                           EPS
##                                                                     Other Predictors
## EPS                                  P.E.Ratio, P.B.Ratio, Market.Cap.Classification
## P.E.Ratio                 EPS, P.B.Ratio, Market.Cap.Classification, Growth.vs.Value
## P.B.Ratio                 EPS, P.E.Ratio, Market.Cap.Classification, Growth.vs.Value
## Market.Cap.Classification                 EPS, P.E.Ratio, P.B.Ratio, Growth.vs.Value
## Growth.vs.Value                      P.E.Ratio, P.B.Ratio, Market.Cap.Classification

The VIF test shows that there is no multicollinearity between our predictor variables. Obviously the polynomial and interaction_variable are going to be correlated which is why we use type “predictor” only. None of the values are greater than 5 so no multicollinearity. ### Marginal Effects Estimated

coefs <- coef(model_5)
marginal_effects <- coefs[-1]  # for some reason in R this means all but the first (this language drives me insane)
barplot(
  marginal_effects,
  main = "Marginal Effects of Predictors",
  ylab = "Marginal Effects",
  las = 2,
  col = "skyblue",
  horiz = FALSE
)

effects_model_5 <- allEffects(model_5)
filtered_effects <- effects_model_5[!grepl("\\^", names(effects_model_5))]
# R HAS THE WORST ITERATOR SYNTAX I HAVE EVER SEEN :(((
for (i in seq_along(filtered_effects)) {
  effect <- filtered_effects[i]
  plot(effect)
}

print(data.frame(Variable = names(marginal_effects), MarginalEffect = marginal_effects))
##                                                              Variable
## EPS                                                               EPS
## P.E.Ratio                                                   P.E.Ratio
## P.B.Ratio                                                   P.B.Ratio
## Market.Cap.ClassificationMid Cap     Market.Cap.ClassificationMid Cap
## Market.Cap.ClassificationSmall Cap Market.Cap.ClassificationSmall Cap
## Growth.vs.ValueValue                             Growth.vs.ValueValue
## I(EPS^2)                                                     I(EPS^2)
## I(EPS^3)                                                     I(EPS^3)
## I(P.E.Ratio^2)                                         I(P.E.Ratio^2)
## I(P.E.Ratio^3)                                         I(P.E.Ratio^3)
## I(P.B.Ratio^2)                                         I(P.B.Ratio^2)
## I(P.B.Ratio^3)                                         I(P.B.Ratio^3)
## EPS:Growth.vs.ValueValue                     EPS:Growth.vs.ValueValue
##                                    MarginalEffect
## EPS                                  3.361210e+01
## P.E.Ratio                            1.901050e+00
## P.B.Ratio                            7.069327e+00
## Market.Cap.ClassificationMid Cap    -1.294353e+01
## Market.Cap.ClassificationSmall Cap  -6.207608e+00
## Growth.vs.ValueValue                 2.597568e+01
## I(EPS^2)                            -5.682798e-01
## I(EPS^3)                             6.706735e-03
## I(P.E.Ratio^2)                      -5.419157e-03
## I(P.E.Ratio^3)                       3.705277e-06
## I(P.B.Ratio^2)                      -1.355201e-01
## I(P.B.Ratio^3)                       6.499942e-04
## EPS:Growth.vs.ValueValue            -1.265159e+01

As we can see from the marginal effects from Effects.allEffects all of them are nicely linear except for P.B.Ratio and P.E.Ratio. A smooth, non-linear relationship between the variables and Recent.Close.Price is observed. The effect of the variables increases and decreases at a non-constant rate as they change, suggesting that a linear model does not adequately capture the complexity of this relationship. We do have quadratic and cubic terms of these variables allowing me to feel comfortable with proceeding with this model and not excluding these variables. Additionally we already tested a model without the linear versions of these terms and it had a lower BIC score suggesting the current version of the model is still statistically significant. ### Boostrapping

n_boot <- 1000  # Number of bootstrap samples
# we will create vectors to store all of the values that we need
alpha_b <- numeric(n_boot)  # Intercept
beta_b <- numeric(n_boot)   # Slope
d_alpha_b <- numeric(n_boot)  # Confidence interval width for intercept
d_beta_b <- numeric(n_boot)   # Confidence interval width for slope
bootstrap_r2 <- numeric(n_boot)  # Storage for R² values

# Perform bootstrapping
set.seed(123)  # For reproducibility
for (i in 1:n_boot) {
  # Resample data with replacement
  boot_indices <- sample(1:nrow(stock_data), nrow(stock_data), replace = TRUE)
  boot_sample <- stock_data[boot_indices, ]
  
  # Fit the OLS model on the bootstrap sample
  
  reg_result <- lm(interaction_formula, data = boot_sample)
  
  # Store the intercept and slope coefficients
  alpha_b[i] <- reg_result$coefficients[1]
  beta_b[i] <- reg_result$coefficients[2]
  
  # Calculate and store confidence intervals
  ci <- confint(reg_result, level = 0.95)
  d_alpha_b[i] <- ci[1, 2] - ci[1, 1]  # Width of confidence interval for intercept
  d_beta_b[i] <- ci[2, 2] - ci[2, 1]   # Width of confidence interval for slope
  
  # Store R²
  bootstrap_r2[i] <- summary(reg_result)$r.squared
}

# Set up plotting area
par(mfrow = c(2, 3))  # 2 rows, 3 columns for the plots

# Plot distribution of the intercept
hist(alpha_b, main = "Bootstrap Intercept", xlab = "Intercept", col = "lightblue", breaks = 30)

# Plot distribution of the slope
hist(beta_b, main = "Bootstrap Slope", xlab = "Slope", col = "lightgreen", breaks = 30)

# Plot distribution of confidence interval widths for intercept
hist(d_alpha_b, main = "CI Width for Intercept", xlab = "CI Width", col = "lightpink", breaks = 30)

# Plot distribution of confidence interval widths for slope
hist(d_beta_b, main = "CI Width for Slope", xlab = "CI Width", col = "lightyellow", breaks = 30)

# Plot distribution of R² values
hist(bootstrap_r2, main = "Bootstrap R²", xlab = "R²", col = "lightcoral", breaks = 30)

# Residuals plot for the original model
plot(fitted(model_5), resid(model_5), main = "Residuals vs Fitted", 
     xlab = "Fitted Values", ylab = "Residuals", col = "blue")
abline(h = 0, col = "red", lty = 2)

The bootstrapping results show stable estimates for the intercept, slope, and R^2, with narrow confidence intervals for both the slope and intercept. The model appears well-specified, with no major patterns in the residuals. With this I will conclude testing for our model and proceed to the Summary.

Summary

summary(model_5)
## 
## Call:
## lm(formula = interaction_formula, data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174.16  -20.52    1.16   17.95  579.78 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -8.159e+01  1.225e+01  -6.660 8.81e-11 ***
## EPS                                 3.361e+01  2.372e+00  14.173  < 2e-16 ***
## P.E.Ratio                           1.901e+00  1.854e-01  10.253  < 2e-16 ***
## P.B.Ratio                           7.069e+00  1.163e+00   6.078 2.77e-09 ***
## Market.Cap.ClassificationMid Cap   -1.294e+01  7.060e+00  -1.833 0.067482 .  
## Market.Cap.ClassificationSmall Cap -6.208e+00  9.757e+00  -0.636 0.524986    
## Growth.vs.ValueValue                2.598e+01  1.192e+01   2.179 0.029927 *  
## I(EPS^2)                           -5.683e-01  1.576e-01  -3.605 0.000351 ***
## I(EPS^3)                            6.707e-03  2.479e-03   2.706 0.007097 ** 
## I(P.E.Ratio^2)                     -5.419e-03  6.533e-04  -8.295 1.56e-15 ***
## I(P.E.Ratio^3)                      3.705e-06  5.067e-07   7.313 1.38e-12 ***
## I(P.B.Ratio^2)                     -1.355e-01  2.956e-02  -4.585 6.02e-06 ***
## I(P.B.Ratio^3)                      6.500e-04  1.677e-04   3.875 0.000124 ***
## EPS:Growth.vs.ValueValue           -1.265e+01  1.116e+00 -11.337  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.94 on 412 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8165 
## F-statistic: 146.4 on 13 and 412 DF,  p-value: < 2.2e-16

The objective of this analysis was to develop a model that predicts stock prices based on various financial variables. The results show that EPS, P/E ratio, and P/B ratio are the most influential factors on stock price, with higher values of these variables generally leading to higher stock prices. Specifically, a one-unit increase in EPS is associated with a rise of 33.61 in the stock’s recent close price, while higher P/E and P/B ratios also positively affect the stock price, albeit at a diminishing rate due to the inclusion of quadratic and cubic terms for these predictors. This suggests that extreme values of EPS and P/E ratios lead to diminishing returns, indicating that higher profitability and market valuation are not always linearly related to stock price increases. The Growth vs. Value classification also plays a role, with growth stocks commanding higher prices, showing that investors tend to value growth potential more than current book value. It was interesting to note that market cap was not as important as I would have assumed that a small cap stock would automatically indicate a stock price of lower than ~$40 or some other arbritary number. The interaction term between EPS and Growth vs. Value reveals that the positive effect of EPS on stock price is weaker for growth stocks, suggesting that while EPS is a key driver for stock price, its impact is moderated for growth stocks. The model explains 82% of the variance in stock prices, with the residuals indicating a good fit. I do want to note however this model is probably overfitted and further cross validation should be conducted before anyone uses this in field to inform their stock purchases. We also note from the Cook’s plot’s from before there is presence of a certain amount of outliers but, we decided to keep them in our final model. It is also interesting to note that the quadratic terms had negative marginal effects while the cubic terms had a positive marginal effect, I am not sure what this means statistically but it is something to point out.